Equivalence of space and underscore

Tristan Miller

unread,

Jul 13, 2012, 12:21:58 PM7/13/12

to uby-...@googlegroups.com

Greetings.

One thing I have noticed about Uby, or at least its getSenses() method,
is that it treats spaces and underscores as distinct characters, even if
the underlying LSR does not.

For example, Wiktionary, Wikipedia, and WordNet all treat spaces and
underscores as equivalent. In their "official" or most popular
interfaces (the Wiktionary and Wikipedia websites, and the JWNL API,
respectively), searches for "ice cream" and "ice_cream" both return the
same entry. Also, in my experience, electronic data which references
Wikipedia article titles or WordNet senses tends to favour the
underscore when writing multi-word expressions.

I was wondering whether it was a conscious design decision for Uby to
treat the two characters distinctly, and if so, what the reasons for it
were. If not, perhaps you would consider revising the API so that calls
to getSenses() and other methods which accept a word form do not
distinguish between spaces and underscores. This would bring their
behaviour into line with at least one other popular LSR's API, and would
obviate the need to preprocess tagged corpora for use with Uby.

Regards,
Tristan

--
Tristan Miller
Ubiquitous Knowledge Processing Lab
Department of Computer Science, Technische Universität Darmstadt
Tel: +49 6151 16 6166 | Web: http://www.ukp.tu-darmstadt.de/

signature.asc

Judith Eckle-Kohler

unread,

Jul 13, 2012, 4:01:34 PM7/13/12

to UBY Users

Thanks for reporting this API behavior. We will look into that.

--- Judith

On 13 Jul., 18:21, Tristan Miller <mil...@ukp.informatik.tu-

> signature.asc
> < 1 KBAnzeigenHerunterladen

Richard Eckart de Castilho

unread,

Jul 14, 2012, 12:46:56 PM7/14/12

to uby-...@googlegroups.com

Hi,

I do not quite understand the problem. In general, I would say that treating
spaced and underscores in different ways is very reasonable. Why would you
need to preprocess a corpus for use with Uby?

Best,

-- Richard

Tristan Miller

unread,

Jul 16, 2012, 4:33:20 AM7/16/12

to uby-...@googlegroups.com

Greetings.

On 12-07-14 06:46 PM, Richard Eckart de Castilho wrote:
> I do not quite understand the problem. In general, I would say that treating
> spaced and underscores in different ways is very reasonable. Why would you
> need to preprocess a corpus for use with Uby?

In general, yes, treating underscores and spaces differently is
reasonable. But if you are writing an interface to resources which do
not themselves make such a distinction, then it is probably unreasonable
for the interface to do so.

Of course, I don't know whether all the resources Uby provides an
interface to distinguish between spaces and underscores; I just know
that three of them don't. Perhaps the others do, and so preserving the
distinction generally is important. This is why I've started this
thread; to find out whether there is some valid reason for Uby's
behaviour. :)

With respect to the corpora, most or all of the WSD corpora and other
data sets I've worked with, when they need to write multi-word lemmas
and expressions which correspond to a Wikipedia article title or WordNet
synonym, use underscores rather than spaces. (There are many good
arguments for doing it this way, probably the most important of which is
that it eliminates the problem of line breaks within identifiers in
marked-up text.) It's convenient to pass these identifiers as-is to the
relevant API rather than having to first replace the underscores with
spaces. Certainly it's not an insurmountable problem; it's just a bit
of a surprise to have to deal with it after 10+ years of using other
WordNet and Wikipedia interfaces which happily treat the two characters
as equivalent.

signature.asc

Judith Eckle-Kohler

unread,

Jul 26, 2012, 3:30:39 PM7/26/12

to uby-...@googlegroups.com

Hi Tristan,

UBY will not follow this convention of encoding spaces in MWEs as underscores, which is used in WordNet.

There are occurrences of underscore in (evolving) natural language and in machine readable dictionaries, such as Wiktionary, that can not be replaced by a space without changing the meaning. As UBY aims to be a lexical resource with particularly large coverage, we want to keep the distinction between space and underscore.

Consider the following examples:

1) There is a Wiktionary page for the underscore:
http://en.wiktionary.org/wiki/Unsupported_titles/Low_line

2) This Wiktionary page describes a special usage of the underscore in internet slang:
Sometimes used to indicate the start and end of portions of plain text that would be underlined if formatting was available.

That's _really_ amazing!

3) The underscore can be part of an emoticon, wee, e.g.

http://en.wiktionary.org/wiki/Appendix:Emoticons - for instance: ^_- winking smile

Emoticons can be considered as part of newly evolving languages and UBY should be able to represent such lexical items, too.

Best

--- Judith

Reply all

Reply to author

Forward