Query about Semafor tool

M. Solaimani

unread,

Mar 4, 2016, 5:49:12 PM3/4/16

to semafor-users

Hi,

Currently I am using SEMAFOR (found interesting and excellent tool for my research) for one of my project. I download the source code and run it. I cross check the output with the web demo (http://demo.ark.cs.cmu.edu/parse) and found different json result.

My input (taking from news)

-- Adds detail , context , grafs 3 , 5 to end ; adds highlights , updates byline -LRB- CNN -RRB- -- A provincial council candidate and nine of his supporters were killed by the Taliban in Afghanistan two days after they were kidnapped , said Sakhidad Haidari , deputy police chief of northern Sar-e-Pul province .

I have couple of questions:

1. The json in web demo contains the tokens but those are missing in github built output. Is there any way to produce the same output?

2. The victim field are quite different. How does software calculate start and end index of a label of a frame? I took the character index in the github built output and found the

following victim which are different than web.

Github built:

Victim: Sakhidad Haidari , deputy police chief of northern Sar-e-Pul province

Victim: they

Web:

Victim: "A provincial council candidate and nine of his supporters"

Victim: they

I saw the web demo consider the token index (not character index) when calculating start and end but because of

different start/end index in github built json output file, I took character index (which seems more logical to me after

observing the output).

With best regards,

M Solaimani

Research Assistant

University of Texas at Dallas

Sam Thomson

unread,

Mar 5, 2016, 5:26:10 PM3/5/16

to semafor-users

Hi M Solaimani,

The version of SEMAFOR we use in the web demo is https://github.com/Noahs-ARK/semafor/tree/v3.0-alpha-04. It uses a model trained on MaltParser-produced (Stanford basic) dependencies, available here: http://www.ark.cs.cmu.edu/SEMAFOR/semafor_malt_model_20121129.tar.gz. Since we're showing off TurboParser in the demo, we use 3rd-order-TurboParser--produced (Stanford basic) dependencies in the demo, with nearly identical performance.

The software is machine-learned, so there is not a succinct way to describe how spans are selected, other than that it calculates features of spans and tries to weight those features in a way that produces output that agrees with how humans have annotated spans.

I'm a little confused... are you saying the char offsets in the github version disagree with the token offsets in the github version? When you say the github version, which version are you using specifically?

Cheers,

-Sam

Message has been deleted

Gmosy Gnaq

unread,

Apr 11, 2017, 5:58:40 AM4/11/17

to semafor-users

Dear Sam Thomson,

I have downloaded the software version 3.0 alpha 4 and I have tested this for several times and the difference between the result from the online demo and from my copy is obvious as it is also mentioned here. Please check this one and this one. The former is from the online demo and the latter is mine. Please pay attention to labels for "left". My result reports it as "the direction left" which is clearly wrong. But the online demo reports is as "past form of the verb 'leave' " which is right.

Is there anything to justify the difference? I would say it is slightly difficult to believe that the performance between the two parsers is 'nearly identical'. Is there any way for the local copies to produce the same result as the online demo?