positions off

Michael Fox

unread,

Jan 23, 2012, 4:55:44 PM1/23/12

to SuperFastMatch, mr...@virginia.edu

Hi,

It seems that superfastmach is not correctly counting the positions
for the beginnings of matches. The values are not true character
counts, and the further down the documents you go, the larger the
difference is between the superfastmatch value and the actual
character count for each beginning position. Do you know why this is,
and how to calculate the error? I'm trying to upload superfastmatch
results to another program, which reads the position values as true
character counts.

Thanks,
Michael

Donovan Hide

unread,

Jan 23, 2012, 5:11:14 PM1/23/12

to superfa...@googlegroups.com

Hi Michael,

thanks for the feedback! Can I ask three questions?

* Can you paste the output of "git show" into a reply? Just want to
check what commit you have checked out and built.
* Do the documents you have loaded contain Unicode characters?
* Are you seeing the error in the User Interface or in the JSON output
(assuming you have a recent build with JSON output)?

I'm guessing you might have an old version which doesn't include this fix:

https://github.com/mediastandardstrust/superfastmatch/commit/c02ad20e93b2bc37725fa8b411d7d70535da56be

There was an error where offsets where incorrect if documents
contained unicode characters and this tended to worsen the further
down the document you went, presuming that each multi-byte rune adds
an incorrect position of between 1 and 3 characters. You should be
able to correct this with a simple "git pull" and "make".

In the long run I'd prefer to use this library:

http://site.icu-project.org/

for dealing with the situation a bit more cleanly.

Can I be nosy and ask ask what your use case is? Very interested in
people's applications of SFM :)

Cheers.
Donny.

Michael Fox

unread,

Jan 24, 2012, 12:26:43 PM1/24/12

to SuperFastMatch

Hi, Donny,

Thanks for replying so quickly.

I don't think we installed it using git. Is there some other way I
can check the version. One of our sys admins is installing the
latest, which you pointed to. Our documents are all plain text with
no Unicode characters. We're just examining the html output, which
shows the To and From columns. The numbers there don't correspond to
the character counts of the positions.

Actually, SFM is going to be extremely useful for us. I'm an English
grad student at UVA. We're incorporating SFM into Juxta, which is a
text collation tool suite. The problem with Juxta is that it can't
find far-distance differences between texts. With SFM we can find the
far-distance similarities and, by inversion, the differences. In the
future we might ask you guys to help us completely integrate SFM with
Juxta. At the moment, I'm just taking the results of SFM and
uploading them to Juxta through an API.

Michael

On Jan 23, 5:11 pm, Donovan Hide <donovanh...@gmail.com> wrote:
> Hi Michael,
>
> > It seems that superfastmach is not correctly counting the positions
> > for the beginnings of matches. The values are not true character
> > counts, and the further down the documents you go, the larger the
> > difference is between the superfastmatch value and the actual
> > character count for each beginning position. Do you know why this is,
> > and how to calculate the error? I'm trying to upload superfastmatch
> > results to another program, which reads the position values as true
> > character counts.
>
> thanks for the feedback! Can I ask three questions?
>
> * Can you paste the output of "git show" into a reply? Just want to
> check what commit you have checked out and built.
> * Do the documents you have loaded contain Unicode characters?
> * Are you seeing the error in the User Interface or in the JSON output
> (assuming you have a recent build with JSON output)?
>
> I'm guessing you might have an old version which doesn't include this fix:
>

> https://github.com/mediastandardstrust/superfastmatch/commit/c02ad20e...

Donovan Hide

unread,

Jan 24, 2012, 12:51:45 PM1/24/12

to superfa...@googlegroups.com

Hi Michael,

> I don't think we installed it using git. Is there some other way I
> can check the version. One of our sys admins is installing the
> latest, which you pointed to. Our documents are all plain text with
> no Unicode characters. We're just examining the html output, which
> shows the To and From columns. The numbers there don't correspond to
> the character counts of the positions.

I think if you're still seeing HTML you probably have quite an old
version :) Give the latest a try, you might have to reload the
documents along with a -reset flag on the command line. It would be
really useful to see a screenshot of the error if it occurs with the
latest build. The current UI highlights the text as you scroll over
each fragment.

Also, after associating your corpus, have a look at the JSON output
from http://127.0.0.1:8080/document/12/99/ where you have substituted
the 12 for the doctype and 99 for the docid. If the fragments are
incorrect, I'd be very interested to see the two texts that are
associated, if possible. Are you using a public domain corpus?

> Actually, SFM is going to be extremely useful for us. I'm an English
> grad student at UVA. We're incorporating SFM into Juxta, which is a
> text collation tool suite. The problem with Juxta is that it can't
> find far-distance differences between texts. With SFM we can find the
> far-distance similarities and, by inversion, the differences. In the
> future we might ask you guys to help us completely integrate SFM with
> Juxta. At the moment, I'm just taking the results of SFM and
> uploading them to Juxta through an API.

Sounds very interesting! Hopefully the JSON output will help you speed
up the Juxta import process. If you want to specify an SFM export
format that works well with Juxta I'd be more than happy to bolt it on
to the API :) All the JSON output is created using Google ctemplate:

http://code.google.com/p/ctemplate/

so it's relatively easy to add alternative rendering in forms like
XML/SQL/GraphViz ...

Must admit I'm very curious about your corpus and research!. Have you
tried searching long texts against themselves? It's really interesting
to pick out the repetitions in fiction, for instance "The gentleman in
the white waistcoat" in Oliver Twist. Another idea is comparing the
works of a single author. With Dickens, you can definitely see turns
of phrase being reused between books. In some cases, you can almost
tell which of his other books he might have been re-reading as he
wrote a chapter in another, the clusters are that apparent!

Anyway, let me know how you get on.

Cheers,
Donny.

Michael Fox

unread,

Jan 24, 2012, 1:34:49 PM1/24/12

to SuperFastMatch

Hi, Donny,

Thanks for this.

Uh oh, the new version has thrown us off. We've written these scripts
to parse the HTML output and upload JSON date to Juxta. Have a look
at the API for Juxta:

http://code.google.com/p/juxta/wiki/JuxtaWebServiceApi?ts=1311779178&updated=JuxtaWebServiceApi

Is there any way you could work with that? Can I see the JSON output
somewhere other than the browser?

I've got to go to a meeting now, but I'll be in touch to tell you
about our research. I'm glad you're interested, and I'm glad we're in
touch!

Also, could we correspond via email? I forget to check the forum here
sometimes.

Michael

On Jan 24, 12:51 pm, Donovan Hide <donovanh...@gmail.com> wrote:
> Hi Michael,
>
> > I don't think we installed it using git. Is there some other way I
> > can check the version. One of our sys admins is installing the
> > latest, which you pointed to. Our documents are all plain text with
> > no Unicode characters. We're just examining the html output, which
> > shows the To and From columns. The numbers there don't correspond to
> > the character counts of the positions.
>
> I think if you're still seeing HTML you probably have quite an old
> version :) Give the latest a try, you might have to reload the
> documents along with a -reset flag on the command line. It would be
> really useful to see a screenshot of the error if it occurs with the
> latest build. The current UI highlights the text as you scroll over
> each fragment.
>
> Also, after associating your corpus, have a look at the JSON output

> fromhttp://127.0.0.1:8080/document/12/99/where you have substituted

Michael Fox

unread,

Jan 24, 2012, 1:38:29 PM1/24/12

to SuperFastMatch

By the way, because we're working with matches and not differences, we
set the editDistance to 0 for each annotation.

We really just need an array of From and To positions, and lengths.

Mike

On Jan 24, 1:34 pm, Michael Fox <foxmichael...@gmail.com> wrote:
> Hi, Donny,
>
> Thanks for this.
>
> Uh oh, the new version has thrown us off. We've written these scripts
> to parse the HTML output and upload JSON date to Juxta. Have a look
> at the API for Juxta:
>

> http://code.google.com/p/juxta/wiki/JuxtaWebServiceApi?ts=1311779178&...

>
> Is there any way you could work with that? Can I see the JSON output
> somewhere other than the browser?
>
> I've got to go to a meeting now, but I'll be in touch to tell you
> about our research. I'm glad you're interested, and I'm glad we're in
> touch!
>
> Also, could we correspond via email? I forget to check the forum here
> sometimes.
>
> Michael
>
> On Jan 24, 12:51 pm, Donovan Hide <donovanh...@gmail.com> wrote:
>
>
>
>
>
>
>
> > Hi Michael,
>
> > > I don't think we installed it using git. Is there some other way I
> > > can check the version. One of our sys admins is installing the
> > > latest, which you pointed to. Our documents are all plain text with
> > > no Unicode characters. We're just examining the html output, which
> > > shows the To and From columns. The numbers there don't correspond to
> > > the character counts of the positions.
>
> > I think if you're still seeing HTML you probably have quite an old
> > version :) Give the latest a try, you might have to reload the
> > documents along with a -reset flag on the command line. It would be
> > really useful to see a screenshot of the error if it occurs with the
> > latest build. The current UI highlights the text as you scroll over
> > each fragment.
>
> > Also, after associating your corpus, have a look at the JSON output

> > fromhttp://127.0.0.1:8080/document/12/99/whereyou have substituted

Donovan Hide

unread,

Jan 24, 2012, 1:52:06 PM1/24/12

to superfa...@googlegroups.com

Hi Mike,

> Is there any way you could work with that? Can I see the JSON output
> somewhere other than the browser?

have posted an example output from a corpus I'm working with:

https://gist.github.com/1671797

This can be viewed in a browser by visiting:

http://127.0.0.1:8080/document/5/1

or viewed on the command line using a tool like curl:

curl http://127.0.0.1:8080/document/5/1

or accessed using any HTTP client such as urllib in Python or
equivalent in your favourite programming language.

> We really just need an array of From and To positions, and lengths.

The interesting data for you is the fragments key for each matching
document which contains an array of arrays, eg:

"fragments" : [[410,2732,26,984972577],[423,2745,39,2859269951],[3154,6843,34,3835548155]],

Each entry is a match. Each match is an array of [from position,to
position,length,hash] where the hash is an aid to grouping identical
matches without doing a substring operation.

You might have to rewrite your importer to cope with this format, but
it is likely to be fairly stable now. The previous HTML output was
just a stopgap prior to this.

> Also, could we correspond via email? I forget to check the forum here
> sometimes.

Sure, my email is donov...@gmail.com, but probably makes sense to
CC in the mailing list on stuff that is of general interest.

Cheers,
Donny.

Tom Lee

unread,

Jan 24, 2012, 2:42:28 PM1/24/12

to superfa...@googlegroups.com

Just a quiet +1 to using the google group so that the rest of us can benefit from the conversation (if there's anything sensitive to be said then of course I understand the need to take the conversation off-list).

You should be able to subscribe to the group here: http://groups.google.com/group/superfastmatch?pli=1

Michael Fox

unread,

Jan 24, 2012, 6:02:04 PM1/24/12

to SuperFastMatch

Hmm, for some reason I can't figure out how to receive messages from
the group in my email inbox.

Thanks, Donny, for that info! It's made things much easier for us. I
have one question.

Have a look at our sample here: http://chandra.village.virginia.edu:8000/

You'll notice that SFM breaks up the prose at the beginning of each
document. It should include more text in each match, breaking only at
the hyphens that break up words. See what I mean?

Mike

On Jan 24, 2:42 pm, Tom Lee <thomas.j....@gmail.com> wrote:
> Just a quiet +1 to using the google group so that the rest of us can
> benefit from the conversation (if there's anything sensitive to be said
> then of course I understand the need to take the conversation off-list).
>
> You should be able to subscribe to the group here:http://groups.google.com/group/superfastmatch?pli=1
>
>
>
>
>
>
>
> On Tue, Jan 24, 2012 at 1:52 PM, Donovan Hide <donovanh...@gmail.com> wrote:
> > Hi Mike,
>
> > > Is there any way you could work with that? Can I see the JSON output
> > > somewhere other than the browser?
>
> > have posted an example output from a corpus I'm working with:
>
> >https://gist.github.com/1671797
>
> > This can be viewed in a browser by visiting:
>
> >http://127.0.0.1:8080/document/5/1
>
> > or viewed on the command line using a tool like curl:
>

> > curlhttp://127.0.0.1:8080/document/5/1

>
> > or accessed using any HTTP client such as urllib in Python or
> > equivalent in your favourite programming language.
>
> > > We really just need an array of From and To positions, and lengths.
>
> > The interesting data for you is the fragments key for each matching
> > document which contains an array of arrays, eg:
>
> > "fragments" :
> > [[410,2732,26,984972577],[423,2745,39,2859269951],[3154,6843,34,3835548155]],
>
> > Each entry is a match. Each match is an array of [from position,to
> > position,length,hash] where the hash is an aid to grouping identical
> > matches without doing a substring operation.
>
> > You might have to rewrite your importer to cope with this format, but
> > it is likely to be fairly stable now. The previous HTML output was
> > just a stopgap prior to this.
>
> > > Also, could we correspond via email? I forget to check the forum here
> > > sometimes.
>

> > Sure, my email is donovanh...@gmail.com, but probably makes sense to

> > CC in the mailing list on stuff that is of general interest.
>
> > Cheers,
> > Donny.
>

> > >> > fromhttp://127.0.0.1:8080/document/12/99/whereyouhave substituted

Donovan Hide

unread,

Jan 24, 2012, 6:48:40 PM1/24/12

to superfa...@googlegroups.com

> Hmm, for some reason I can't figure out how to receive messages from
> the group in my email inbox.

Updated your subscription, you should get this in your inbox :)

> Have a look at our sample here: http://chandra.village.virginia.edu:8000/

Great to see the UI load from the other side of the Atlantic!

> You'll notice that SFM breaks up the prose at the beginning of each
> document. It should include more text in each match, breaking only at
> the hyphens that break up words. See what I mean?

The Fragments window orders by from position by default, i.e. in
document order. To get the long matches first click the "Length" grid
header twice. The long matches are truncated and have an ellipsis mark
at the end, hover over them and an orange highlight should show the
full match.

The shorter matches seem to be a bit too short because of the white
space threshold set too low. With a setting of 0.5, if half of a
window is non-alphanumeric, then the match will stop. Try a higher
setting up to a limit of 1.0 to see if that helps. It's useful as a
filter when documents have lots of useless formatting like "* * * *"
chapter breaks.

Let me know if this works!

Cheers,
Donny.

Donovan Hide

unread,

Jan 24, 2012, 6:59:57 PM1/24/12

to superfa...@googlegroups.com

Oops, forgot to paste this link:

https://github.com/mediastandardstrust/superfastmatch/blob/master/src/registry.cc#L31-32

The command line parameter is "white_space_threshold"

ie.

superfastmatch -white_space_threshold 0.8

Fox

unread,

Jan 24, 2012, 10:26:40 PM1/24/12

to superfa...@googlegroups.com

Thanks again, Donny.

But I'm still a bit confused. You'll notice that the first paragraph (under "Preface") of each document constitutes almost an exact match. Only the hyphens should cause SFM to return multiple matches, but you'll also notice that it returns, for example, "I need not dilate here on the characteristics of" and stops there. It should keep going beyond that first line. Does the problem have to do with linebreaks?

Mike

Donovan Hide

unread,

Jan 24, 2012, 10:43:48 PM1/24/12

to superfa...@googlegroups.com

> But I'm still a bit confused. You'll notice that the first paragraph (under
> "Preface") of each document constitutes almost an exact match. Only the
> hyphens should cause SFM to return multiple matches, but you'll also notice
> that it returns, for example, "I need not dilate here on the characteristics
> of" and stops there. It should keep going beyond that first line. Does the
> problem have to do with linebreaks?

Differing linebreaks between the two editions do seem to be the
problem! If it's ok, can you send the me the two untouched text files
as attachments? Did you load them using the load.sh example script?

A carriage return should be counted as a single whitespace and
therefore be equivalent to a straightforward space. Bit confused, the
original documents will help me solve it.

Cheers,
Donny.

Fox

unread,

Jan 24, 2012, 10:47:21 PM1/24/12

to superfa...@googlegroups.com

Ok, see attached.

Yeah, I used gutenberg.sh which uses load.sh.

Mike

earlyItalianPoets.txt

danteCircle.txt

Fox

unread,

Jan 24, 2012, 10:55:59 PM1/24/12

to superfa...@googlegroups.com

It may be that they're end-of-line characters ($), not new line characters.

M

Donovan Hide

unread,

Jan 26, 2012, 10:44:40 AM1/26/12

to superfa...@googlegroups.com

Hi Mike,

the latest commits (I hope!) contain a fix for the line ending problem
you were having. Line breaks were only being normalised for the
purposes of counting the amount of whitespace in a window and not for
calculating the actual hash. I've fixed this by clamping all ASCII
codes below 47 to a single value of 47. This is UTF-8 compatible and
seems to have improved things a lot in the sense that fewer, but
longer fragments are found.

Not sure I can easily deal with the added hyphenation that your two
examples have as this adds an extra character, but maybe the layout
decisions are interesting in themselves :)

Let me know if it works ok for you.

Cheers,
Donny.

Fox

unread,

Jan 26, 2012, 10:46:31 AM1/26/12

to superfa...@googlegroups.com

Thank you, Donny!

Yes, those hyphens are important.

I'll install the latest and let you know how it goes.

Mike

Fox

unread,

Jan 26, 2012, 11:50:36 AM1/26/12

to superfa...@googlegroups.com

Perfect!

Fox

unread,

Jan 31, 2012, 11:55:28 AM1/31/12

to superfa...@googlegroups.com

Hi, Donny,

There still seems to be a problem with broken up blocks.

Here's an example (if you have time to look at it):

Username: juxta
Password: juxta!@l0g1n!
http://juxta.performantsoftware.com:8182/juxta/public/set/89/view?mode=sidebyside&docs=208,207

Search for: "ton cri d'oiseau surpris et de fascine" in the right-hand text. And click on the selection to align the left-hand text properly. Notice how the two matching lines following this line are broken up.

Is there a way to display the SFM results on a scrolling page, so that I can search through them easily?

You can see the SFM results here: http://chandra.village.virginia.edu:8000/

The documents in question are chiens_2.txt and chiens_1.txt. See attached.

Mike

chiens_2.txt

chiens_1.txt

Donovan Hide

unread,

Feb 1, 2012, 9:07:17 AM2/1/12

to superfa...@googlegroups.com

Hi Mike,

> Search for: "ton cri d'oiseau surpris et de fascine" in the right-hand
> text. And click on the selection to align the left-hand text properly.
> Notice how the two matching lines following this line are broken up.

looks like Juxta might well be triming trailing and leading whitespace
before processing. You can do something similar using sed:

sed 's/^[ \t]*//;s/[ \t]*$//' chiens_2.txt > chiens_2.stripped.txt

or see:

http://www.cyberciti.biz/tips/delete-leading-spaces-from-front-of-each-word.html

I could do this in SFM itself, but it would be hard to calculate the
positions for the untouched original document.

> Is there a way to display the SFM results on a scrolling page, so that I can search through them easily?

I'm currently working on a side-by-side view, much like Juxta's one,
which will greatly assist in seeing matched fragments in the context
of both documents. Will let you know when it's ready for a test!

Hope that helps,
Donny.

Fox

unread,

Feb 1, 2012, 9:23:48 AM2/1/12

to superfa...@googlegroups.com

Oh, I just wanted to reiterate--what you see in Juxta is really the result of SFM. We've just uploaded the position data resulting from SFM to Juxta. But yeah, I think you're right; Juxta is probably trimming whitespace before visualizing the position data we've given it.

Mike

Fox

unread,

Feb 1, 2012, 9:38:25 AM2/1/12

to superfa...@googlegroups.com

Actually, at the moment, it's not quite just the result of SFM. Up until now, Juxta has only been used for determining and examining differences between texts, but of course it breaks down when the differences are far away from each other. That's why we need SFM. Juxta does various things to visualize the differences like merging blocks of text separated by small differences. So, what you see now are the results of SFM, but we've had to confuse Juxta into visualizing them as matches by setting their edit distances to 0. Until we develop a 'match' mode for Juxta, it will still do the merging and trimming that it normally does for differences, so things will appear awry, or at least different from SFM. I couldn't determine if the problem with those three lines lies on the Juxta side or the SFM side because I couldn't search through the SFM results easily. But yeah, it's probably on the Juxta side.

Mike

Donovan Hide

unread,

Feb 1, 2012, 9:47:37 AM2/1/12

to superfa...@googlegroups.com

Hi Mike,

I didn't realise that Juxta was showing SFM matches! It is a really
nice side-by-side view. Have you got a contact at Performant for the
developer behind the jQuery code?

I think the splits in the match are due to the fact that the leading
whitespace lengths are different between the two documents. If you
normalise both documents before loading them into SFM, by running them
through the sed snippet first, then the match should be longer and
correct. In other words, the error is in the source document which is
then reflected in both SFM and Juxta.

Cheers,
Donny.

Fox

unread,

Feb 1, 2012, 9:52:10 AM2/1/12

to superfa...@googlegroups.com

Hey, Donny,

Yeah, the guys at Performant work for us. They're developing more and more visualization features for 'match' mode, now that I've shown them that we have the reliable SFM and can transfer results from it to Juxta.

Ok, I see what you mean--that the error goes back to the source docs. I'll fix them.

Mike

Donovan Hide

unread,

Feb 1, 2012, 9:59:49 AM2/1/12

to superfa...@googlegroups.com

Very interesting! How are you dealing with the case where an identical
fragment appears multiple times in both documents? It's the
visualisation issue that's most taxing my brain!

Fox

unread,

Feb 1, 2012, 10:15:34 AM2/1/12

to superfa...@googlegroups.com

Well, I think they're going to make it so that when you click on a block, not only does the other doc get aligned to the first matching block, but a popup comes up and gives you options for other locations from which to select.

We also have a heatmap, which indicates that multiple documents contain matches.

Fox

unread,

Feb 1, 2012, 10:17:54 AM2/1/12

to superfa...@googlegroups.com

By the way, Juxta is open source, if you want to have a look at it.

http://code.google.com/p/juxta/source/checkout

Mike

Fox

unread,

Feb 1, 2012, 10:18:49 AM2/1/12

to superfa...@googlegroups.com

Or even if you want to help contribute! :)

Fox

unread,

Feb 1, 2012, 10:20:26 AM2/1/12

to superfa...@googlegroups.com

Right now it's being used as a stand alone, but they're making it into a web service.

Reply all

Reply to author

Forward