Another question arises, and that has to do with the way the word
pairs have been presented....it seems like certain function words may
have been omitted. Sometimes this causes no trouble...
reinvent wheel
is pretty clearly "reinvent the wheel"
whereas
name come
has me absolutely flummoxed, particularly since it is shown as being
"medium literal". I can't work out what the original phrase must be,
and I don't think it's just "name come". Other ones that I'm
particularly puzzled by include
run number
name imply
interest lie
All show as medium literal, and I can't really even construct a
sentence using them that doesn't end up sounding really weird...
My name come from a far off land where grammar isn't taught so much...
I run number seven off the field
His name imply ... ?
My interest lie at the bottom of the sea?
What sort of filtering has been done on the pairs we are seeing in the
data (as compared to what appeared in the original context?)
Thanks!
Ted
--
Ted Pedersen
http://www.d.umn.edu/~tpederse
Thanks for these clarifications, this is really helpful, and I think
all my questions are answered at this point!
Cordially,
Ted
Just one rather specific question - how are function word defined? If
there a list of function words you've used, or is it a set of
particular POS categories? Knowing that would be helpful in enabling
some sort of "look up" from the Wacky corpus (to find the contexts
where the training examples have originated...)
Thanks!
Ted
On Fri, Mar 4, 2011 at 10:32 AM, Organizer DISCo Workshop 2011
<disco201...@gmail.com> wrote:
Thanks once again!
This is just a bit of thinking aloud - I'm a little puzzled as to what
to do with the Wacky corpus, since it does seem like it will be hard
to locate occurrences of the phrases included in the test data. I've
been trying to do that with the training data and it's a bit of a
puzzle. What I have been attempting is to take that third column in
the part of speech tagged version and use that as my "text" (since
that's the base form that we are getting in the phrases). Then I was
hoping to strip out the intermediate function words and hopefully be
left with phrases like "reinvent wheel" that could be lifted fairly
directly from the corpus. Now, what I would do after I actually
accomplished this is another question, mostly this is a "get to know
your data" exercise. Hmmm.
So...it's good to know that I probably shouldn't expect the above to
work especially well, and that's actually quite helpful (and will save
quite a bit of time...)
Hmmm. :)
Thanks,
Ted
On Wed, Mar 16, 2011 at 11:02 AM, Organizer DISCo Workshop 2011
The patterns you used bring to mind a couple questions that I have
recently encountered.
At least in the English WaCKy corpus, POS tags appear to follow the
usual Penn Treebank standard but there are some exceptions.
Specifically, verbs tags may begin with either VB or VV. The tags VB,
VBG, VBP, VBZ, and VBN all occur in the corpus along with VV, VVG,
VVP, VVZ and VVN (and others).
(1) Is there any documentation about the POS tags found in the WaCKy corpora?
Secondly, the relations in the dependency-parsed version of the
English WaCKy corpus appears to use relations with similar names to
those listed in the CoNLL 2008 challenge but there are some
discrepancies. I have been unable to find any source describing the
definitions of the relations used by Malt Parser (which used to
generate the dependencies).
(2) Do you know of any resource that defines these dependency relations?
Thanks!
Connie
On Wed, Mar 16, 2011 at 1:02 PM, Organizer DISCo Workshop 2011
Connie
regarding your second question, I do not know of any resource that
defines these relations.
thanks,
Chris
> For German tasks:
> Adjective modifiers: ADJ* NN
> Verb-Subject and Verb-object: Extracted VV* - NN* pairs within a window of five words and scanned manually.
I am afraid I am too late to ask this question. I heard German is
relatively free-word order compared to English and I wonder how to
distinguish between verb-subj and verb-obj patterns automatically. Can
you give me some hints. I am using deWaC corpus. Sorry if my question
is wrong.
Regards,
Siva
--
Siva Reddy http://www.sivareddy.in
In case you need to identify verb-subj vs. verb-obj for the test data: it is relatively safe to say that all occurrences of the noun and the verb in a window of up to 5 words are either unrelated or in the relation as given in the test data.I hope this helps,