Non-localized punctuation handling?

Steve Richfie1d

unread,

Nov 6, 2005, 1:21:53 PM11/6/05

to

I'm working on getting an AI application running WITHOUT specifying the
input language - if it recognizes the words in whatever language they
were written in, then who cares what the language was? This is to solve
the mixed-language and foreign-terms problems. This is not without its
problems, the worst of which (so far) seems to be in handling the
various forms of conflicting punctuation in the various languages.

All I need do for initial parsing is to properly recognize what is a
word, the value of numbers, etc., for subsequent recognition and processing.

For example, in English we use the "-" character for many things, e.g. a
unary minus sign, a binary minus sign, a dash (there is a different
ASCII dash that no one seems to ever use), an end-of-line hyphen, a
middle-of-word hyphen, etc. German uses no middle-of-word hyphens (they
use run-on words instead), so using English logic causes no problems
because the extra cases in English don't appear in other languages to be
screwed up, so there is no conflict here.

However, numbers are a bigger problem, as 12.345 is only 1/1000th as big
in English as it is in other European languages. Further, one billion is
only one thousanth as large in American English as it is in British English.

I can handle the ./, problem by recognizing English, probably easiest
done by looking for the presence of auxiliary verbs (are, do, be, have,
shall, may, can, must, and their variations). If not English, I don't
need to know what language it is because no other language besides
English (that I know of) inverts the usage of commas and decimal points
in numbers.

Am I just seeing the tip of the iceberg here or is this the entire
problem? How do Eastern languages (not a problem now, but I see it
coming in the future) do their punctuating? What ELSE might I need to
concern myself with? Is there some reference or web site that discusses
these issues?

Thanks in advance for your time and help.

Steve Richfie1d

Beth N

unread,

Nov 8, 2005, 7:39:23 PM11/8/05

to

I would say you are just seeing the tip of the iceberg, judging by this
statement:

>If not English, I don't need to know what language it is because no
other language besides
>English (that I know of) inverts the usage of commas and decimal
points in numbers.

Japanese in Japan would be one example of a locale using a period as the
decimal point, comma as grouping separator.
Another would be Hindi in India (I could go on).
And when you say "inverts the usage of commas and decimal points in
numbers" you seem to assume that
commas and periods are the only choices - non-breaking space and ' would
be other possibilities you would
have to take into account.

Steve Richfie1d

unread,

Nov 10, 2005, 1:28:08 AM11/10/05

to

Beth

Thanks for your comments.

I've also been discussing this with the folks on
<news:comp.ai.nat-lang>. What appears to be the answer is to have a
bunch of fields in the table of languages where the Unicode characters
for each of the many functions in each of the languages are kept. This
would include SEPARATE characters even where they do the same thing in
English, e.g. unary minus, binary minus, hyphen, dash, and end-of-line
continuation would all be filed separately because they could be
different in other languages. This would seem to solve the differing
punctuation problem, PRESUMING that other languages don't have
completely different things that they are punctuating.

HOWEVER, some languages use different characters for digits, e.g. Arabic
2's and 3's look like they fell over onto their face, so numbers appear
to be a special problem in some languages. I can see some possible
solutions for this, but none that I really like. I wonder what Visual
Basic functions like Val and CDbl do with Arabic and other
representations of numbers - do these functions work on such Unicode
characters?

Also, many authors use the WRONG characters that look like the right
characters for things, which can look OK to people but which causes LOTS
of problems for AI software. For example, Japanese has a period that
looks like a degree sign but down at a lower level, which can be
simulated with a an "o" or just a period. Similarly, an Arabic zero is
just a dot but at a higher level, which people might use a period for.
This may actually be a rare case where code pages actually help, because
they map like-looking characters into the SAME character that is
appropriate for the language being used. Unfortunately, code pages have
SO many problems, especially in multi-lingual settings.

Another problem seems to be where people erroneously use characters that
are designed to go in a different direction. I don't know how common
this is.

I wonder what OTHER such special problems are waiting for me?!

Thanks again for your help.

Steve Richfie1d
================