Greetings.
On 12-07-14 06:46 PM, Richard Eckart de Castilho wrote:
> I do not quite understand the problem. In general, I would say that treating
> spaced and underscores in different ways is very reasonable. Why would you
> need to preprocess a corpus for use with Uby?
In general, yes, treating underscores and spaces differently is
reasonable. But if you are writing an interface to resources which do
not themselves make such a distinction, then it is probably unreasonable
for the interface to do so.
Of course, I don't know whether all the resources Uby provides an
interface to distinguish between spaces and underscores; I just know
that three of them don't. Perhaps the others do, and so preserving the
distinction generally is important. This is why I've started this
thread; to find out whether there is some valid reason for Uby's
behaviour. :)
With respect to the corpora, most or all of the WSD corpora and other
data sets I've worked with, when they need to write multi-word lemmas
and expressions which correspond to a Wikipedia article title or WordNet
synonym, use underscores rather than spaces. (There are many good
arguments for doing it this way, probably the most important of which is
that it eliminates the problem of line breaks within identifiers in
marked-up text.) It's convenient to pass these identifiers as-is to the
relevant API rather than having to first replace the underscores with
spaces. Certainly it's not an insurmountable problem; it's just a bit
of a surprise to have to deal with it after 10+ years of using other
WordNet and Wikipedia interfaces which happily treat the two characters
as equivalent.