Rejecting HITs

22 views
Skip to first unread message

Gabriel Parent

unread,
Mar 27, 2010, 3:41:29 PM3/27/10
to naacl-m...@googlegroups.com
Hi all,

I've used AMT on different projects, most of which didn't involve a lot
of money. I never really felt it was worth spending hours building
gold-standard and compiling workers agreement in order to reject some of
their HITs. (and save 20$?).

Now I'm on a new project : speech labeling and speech transcription.
The transcription will be done on a whole year of dialogs we obtained
through our bus information system. There is around 250K utterances.
That project involves enough money that rejecting work actually makes a
big difference on the total cost of the project.

On a first pass (it's a labeling task, where the workers have to select
if a speech utterance is of class A,B or C), I used the following
techniques:

1) build a gold-standard (GS) (out of 2500 utterances annotated by 3
different experts, I kept ~2100 that we all agreed on)
2) insert 1 GS utterance every 9 utterances given to workers.
3) after the work is done, calculate Kappa of the workers, and reject
the work from workers with a Kappa below 0.35 (this included around 10%
of the total number of HITs)

Some workers listened to more than 1000 utterances, but had a kappa
below 0.2...

The day after that, I got several polite (and some less polite) mails
from workers saying they were sure they had answered properly to all
questions, and ask for explanation on why all of their HITs got rejected
(some didn't ask for explanations, they just basically told me I should
visit Hell some day :P ). I got some bad feedback on turkopticon, and I
deserved a post on Turker Nation
(http://turkers.proboards.com/index.cgi?board=everyoneelse&action=display&thread=5755&page=1).
I answered to all the emails, by explaining them the evaluation
technique, and their kappas.

Now there is another pass coming up, and I'm wondering how I should do
it. This time, it's going to be speech transcription, so it's a harder
task to accomplish, but it's also harder to evaluate. What about 1
missing word out of 10 in a single utterance? What about the user
transcribing *mumble* or [cough] or something like that (even though I
don't ask for it)? What about misspelled neighborhood (Mckeesport,
McEasport, Mckeasport) Is an automatic technique even possible?

I think rejecting work is important for two reasons : it weeds out the
really bad workers/bots, and it allows you to save money you can spend
by giving bonus to really good workers (I had some workers who listened
to more than 12,000 utterances, and had Kappa over 0.9! He deserved his
30$ bonus.)

For the problem of rejecting all the HITs, ideas I have had : 1) reject
a certain % of their HITs based on their agreement, 2) evaluate the data
as it comes in, and "block" workers that keep a too low agreement (like
CrowdFlower does). The last one would be a very good option, but I
don't have time to implement that.

I would like to hear from your experiences. Ideas are also welcome!

Cheers,
Gabriel

Brendan O'Connor

unread,
Mar 28, 2010, 8:55:57 PM3/28/10
to naacl-m...@googlegroups.com
Gabriel -- just one thought,

> 3) after the work is done, calculate Kappa of the workers, and reject the
> work from workers with a Kappa below 0.35 (this included around 10% of the
> total number of HITs)
>
> Some workers listened to more than 1000 utterances, but had a kappa below
> 0.2...

Sounds like one of the issues is that you're rejecting lots of work
from people who don't have feedback that they're not doing the task
right. i assume they can't see their own accuracy rate or kappa score
while they're in the middle of those 1000 utterances. If they had no
idea that they were doing poorly, it seems unfair to reject *all* of
their work after the fact. This is what they're saying on the Turker
Nation thread too.

What would be (more) fair is to tell them "you can't work on more of
my HITs in the future, sorry, your work is unacceptable".

Of course, AMT isn't quite set up to handle this use case, since
ideally you don't want to let them do lots of work then have to pay
them for it (or at least part of it). You could use a qualifications
test if a pre-test setup might be sufficient. You mention that
Crowdflower does the setup that you want, of constantly sprinkling in
gold items then warning and blocking workers when they fall below
accuracy thresholds.

This is all for simple multiple choice settings. You mention
trickier, more subjective quality situations too. Manual review might
be the only approach? There are some cases out there of having
Turkers review other Turkers, which seems like it could work well...

The common thread, I think, is that there needs to be more
fine-grained feedback to workers about their quality, instead of one
big review after the worker has done potentially hours of work.

Brendan
--
http://anyall.org

> To unsubscribe from this group, send email to
> naacl-mturk-ws+unsubscribegooglegroups.com or reply to this email with the
> words "REMOVE ME" as the subject.
>

Reply all
Reply to author
Forward
0 new messages