Hi Siva,
thanks for raising this point. Of course, we will describe in detail
how this data was created in a paper, which will be included in the
workshop volume.
Since you'd like to know it now, I enclose the instructions people
got:
-----
How literal is this phrase?
Can you infer the meaning of a given phrase by only considering their
parts literally, or does the phrase carry a 'special' meaning?
In the context below, how literal is the meaning of the phrase in
bold?
Enter a number between 0 and 10.
    * 0 means: this phrase is not to be understood literally at all.
    * 10 means: this phrase is to be understood very literally.
    * Use values in between to grade your decision. Please, however,
try to take a stand as often as possible.
In case the context is unclear or nonsensical, please enter "66" and
use the comment field to explain. However, please try to make sense of
it even if the sentences are incomplete.
Example 1 :
There was a red truck parked curbside.  It looked like someone was
living in it.
YOUR ANSWER: 10
reason: the color of the truck is red, this can be inferred from the
parts "red" and "truck" only  - without any special knowledge.
Example 2 :
What a tour! We were on cloud nine when we got back to headquarters
but we kept our mouths shut.
YOUR ANSWER: 0
reason: "cloud nine" means to be blissfully happy. It does NOT refer
to a cloud with the number nine.
Example 3 :
Yellow fever is found only in parts of South America and Africa.
YOUR ANSWER: 7
reason: "yellow fever" refers to a disease causing high body
temperature. However, the fever itself is not yellow. Overall, this
phrase is fairly literal, but not totally, hence answering with a
value between 5 and 8 is appropriate.
We take rejection seriously and will not reject a HIT unless done
carelessly. Entering anything else but numbers between 0 and 10 in the
judgment field will automatically trigger rejection.
YOUR  CONTEXT with <<TARGET>>
<<SENTENCE WITH TARGET BOLDED>>
----------
Workers are selected on a test batch. Good workers are allowed to do
the real data. Consistency checking is manual.
This is a subjective task and we do not expect people to always agree.
The final scores are averaged, using 'the wisdom of the crowds'.
Please note that on Mturk, you cannot do complicated instructions and
expect people to follow them. Rather, take more people and take the
average.
best,
Chris
On Feb 14, 8:14 am, Siva Reddy <
gvs.i...@gmail.com> wrote:
> Dear organizers,
>
> Could you please elaborate the annotation schema followed by you. On
> what basis did the workers score the compositionality in the scale of
> 0-10. What does it mean when the worker assign a score, say 4. How do
> you assure consistency among the workers?
>
> We are preparing similar data for compositionality in noun-noun
> compounds. It would be very helpful if you can release the annotation
> guidelines.
>
> Regards,
> Siva
>
> > The organizers extracted candidate phrases from two large-scale freely
 > > available web-corpora, UkWaC and DeWaC (cf.
http://wacky.sslmit.unibo.it/),
> > containing respectively English and German POS tagged text. These data
> > have been manually evaluated for compositionality with Amazon Turk.
> > Workers were presented a sentence with a bolded target phrase and were
> > asked to score how literal the phrase was between 0 and 10
>
> ==========================================================
> Siva Reddyhttp://
www.sivareddy.in