My years of interaction on this newsgroup have payed off. I have developed
a new theory of language that makes it possible to use computers as
linguistic machines. You can find out more about this on
http://panlingua.net, where various text files and audio lectures on the
subject can be found, along with examples of interactive software running
in English, Hawaiian, Buru Language, Ambon Malay, Indonesian, and
Vietnamese.
But what I really want to announce in this message is my new web site,
http://witchit.com.
Witchit is a new and innovative kind of web search engine, and I need to
get a lot of people on board to make it work. The basic idea is something
like Wikipedia, but this system operates entirely on natural language, and
I believe it is a prototype, or maybe a seed, of the search engine of the
future.
All of the tables and other data collected in this project will be
available for correction and for use as-is by any member of the public.
Upgrades and corrections will be vetted before being used to replace old
information, but anyone will be free to send these in.
We will also provide search engines free of charge for new kinds of
information provided by the public in usable form. For example, suppose
you have a table relatedto some field of science. We will provide the
software to link this information into our main system and make it
available in natural language (plain English for now).
Here are some of the kinds of things Witchit can already do right now as
explained in response to a "What can you do?" query:
I have only begun to pluck the low-hanging fruit from the vast tree of
knowledge, but already I can do a lot.
Before I say anything else, if you ask me something and I don't know the
answer, and you think I ought to know that answer, just tell it to me by
typing it in, and I will remember it in a day or two. For example:
John Doe was born on may 20, 1985.
With this information, I will later be able to answer questions like the
following:
When was John Doe born?
What is John Doe's birthday?
How old is John Doe?
Remember,your participation in this project is important to me and to all
the people who bother me with questions!
Here are some example questions and commands I can respond to now:
I can tell you when to catch the bus:
When does the bus leave from Ocean View for the University of Hawaii?
If your favorite bus schedule is not available to Witchit, just e-mail it
to Chaumont Devin, and he will make sure it gets included. You can also
just tell it to me as follows--type something like:
Starting terminus 9:00 am, Stop #1 9:05 am, Stop #2 915 am, ...
If I tell you I don't understand this, just ignore me. I will be able to
answer questions about your bus rout in a day or two.
I can find you a recipe:
Find me a soup recipe.
Show me a recipe for salad.
Get me a salad recipe.
I can perform mathematical calculations:
What is5280 * .3048?
What is atn(60)?
(I recognize the following mathematical functions:
sqr() = square root.
log() = natural log.
sin() = sine.
asn() = arc sine.
cos() = cosine.
acs() = arc cosine.
tan() = tangent.
atn() = arc tangent.
(Arc values expressed in degrees.)
I can tell you about books:
What do you have by Jack London?
Find me Alice In Wonderland.
Find Alice In Wonderland.
Where is Alice In Wonderland?
(Make sure every word in the name is capitalized.)
I can tell you something stupid if you type:
Tell me something stupid.
I can tell you what something is:
What is a brontosaur?
I can tell you what something means:
What does pluperfect mean?
What does to metamorphose mean?
I can tell you what happened during a certain year:
What happened in 952 bc?
Or a range of years:
What happened between 1942 and 1945?
I can tell you a birthday:
What is Sarah Palin's birthday?
(You may have to put a backslash before the apostrophe on some systems.)
I can tell you when someone was born:
When was Sarah Palin born?
And I can (shudder) tell you how old somebody is:
How old is Sarah Palin?
I can give you a word of wisdom if you type:
Tell me something.
Tell me anything.
I can give you the time of day in various places if you type, for example:
What time is it in Tokyo?
Tell me the time in London.
I can provide various facts about population and geography. Ask me:
What is the population of Argentina?
What is the area of Mexico?
What is the population density of India?
What is the literacy rate in Paraguay?
Tell me about the currency of Honduras.
Tell me about the coastline of Vietnam.
What is the highest elevation in Sudan?
What is the capital of Panama?
Who is the president of Nicaragua?
What is the gdp of Japan?
Tell me about the economy of Singapore.
I can convert between various kinds of measurement if you ask me things
like:
How many inches in a meter?
How many cubic centimeters in a gallon?
I can tell you the approximate weight of a given volume of water:
How many grams in a milliliter?
How many pounds in a gallon?
I can give you geographical coordinates and great circle distances and
directions in response to questions like:
How far is it from Hong Kong to Singapore?
What is the latitude of Anchorage?
What are the coordinates of Honolulu?
I can tell you where places are:
Where is Ankara?
I can give you geographical coordinates:
What are the coordinates of Ankara?
I can give you latitude only:
What is the latitude of Ankara?
Or longitude:
What is the longitude of Ankara?
Or elevation:
What is the height of Kilimanjaro?
I have over 20,000 place-name entries for the world, and I can give you
similar information for more than two million locations in the US.
I want your feedback. I can convey any one-line message or remark or fact
you type to my creator.
Would you like to get involved in the Witchit project? Right now we
especially need algorithms and tables, but Witchit also accepts single
sentences of information. To pass a single sentence of information to
Witchit, just bring up Witchit.com and type your statement instead of a
question.
You can help us to maintain our internal tables. To locate any internal
table, just click the "source" button after my response, and you will see
the URL for the table if it is one of mine, or else the external source
from which the information was gathered.
Send your results to Chaumont Devin:
contact: (change the '`' to '@') joedevin`witchit.com
Probieren geht ueber studieren.(proving goes above studying)
Could your system translate
http://docs.google.com/Doc?docid=0AQIg8QuzTONQZGZxenF2NnNfNzY4ZDRxcnJ0aHI&hl=en_GB
You will find a short passage in technical Arabic. Could your system
provide a better translation?
Could I ask the following questions?
What is the density of the material in a White Dwarf?
What is the luminosity of a blue giant with 20 times the mass of the
Sun.
What happens when hydrogen is exhausted?
Is there Plutonium in the Crab? (I expect context to indicate the Crab
Nebula)
The answers to all these questions are either contained in Wikipaedia
or deducible from Wiki.
As you can see I am a little bit sceptical. What you are doing appears
very similar to what CYC is doing. Are you aquainted with either CYC,
or the proof engines contained in Alcor/Mizar
My Arabic passage contains a number of key words. Translated into
English these are.
Main Sequence, Blue Giant, Red Dwarf, Red Giant, Supergiant, White
Dwarf, Stefan Boltzmann law, Surface area of sphere
Can you spot these words in context. You need disambiguation too. On a
sign at Jeddah airport there is a bilingual sign - Ground
Transportation (AlArD - note capitals are part of transliteration).
Tranportation you associate with the ground. size of white dwarfs with
the Earth. If I tell you they are the mass of the Sun you should be
able to work out the density and surface gravity. All is in Wiki.
We normally differentiate meanings like that using LSA. You appear not
to have used it.
One last point Google has its finger in a large number of pies. It
owns U Tube. It has got a merchant facility for on line credit card
transactions. It has Google maps, Google this Google that. Google is
going to digitize all books. Clearly it has the clout. If you are
going to start up a search engine you have to prove not only that it
is a bit better than Google, but a lot better. Your best bet therefore
might well be to approach Google or Microsoft. They will I can assure
you be asking the same sort of questions I am asking.
Sorry to be so negative.
- Ian Parker
Dear Ian Parker,
You have a right to be skeptical, and I am happy to allow you to test the
validity of my work.
You write:
>
> Probieren geht ueber studieren.(proving goes above studying)
>
I agree.
> Could your system translate
>
>http://docs.google.com/Doc?docid=0AQIg8QuzTONQZGZxenF2NnNfNzY4ZDRxcnJ0aHI&h
>l=en_GB
>
> You will find a short passage in technical Arabic. Could your system
> provide a better translation?
>
I cannot answer this because so far I have never applied this system to
Arabic. But in general good machine translation is possible based on the
fact that my systems do not just manipulate words, but actually
"understand" text, where "understanding" is defined as disambiguating each
word. In all natural languages I know of, most words can have several
meanings, and can fit together with other words in the same sentence in
several ways. To understand, the machine needs to be able to (1) determine
which of several possible meanings is intended, and (2) determine the
syntactic "regent" for each word, such a regent being some other word
within the same sentence. For example, in the phrase, "The big dog," "dog
is the regent of both "the" and "big," and these words are called the
"dependents" of "dog."
> Could I ask the following questions?
>
> What is the density of the material in a White Dwarf?
>
The system can be told these things by a tutor, after which it will be able
to answer them without any problem. By "told," I mean that the tutor, or
"user" will have to type each fact in and make sure his/her input sentences
are parsed correctly.
> What is the luminosity of a blue giant with 20 times the mass of the
> Sun.
>
This kind of information can best be set up in tabular form. The system
then parses the sentence and looks for the answer in a table (a file on
disk). The basic strength of the system is its ability to parse (reliable
parsing has long been the holy grail of AI research). Because the system
can easily be taught to parse any new kind of sentence, it remains for some
programmer to write an "event handler" that knows how to look up the
answers to a particular kind of query in a table or to determine the answer
by some algorithm, or to look the answer up on the web. Each such
capability (the ability to answer some particular kind of query) can take
from a few minutes to several days to set up, and this is why I am unable
to quickly build a system capable of answering just any question in the
world. It takes a good deal of time and effort even to teach our brightest
children to find the answers to many questions, which is a lot of what good
education is about. At present, it takes at least as long or longer to
equip machines with the capabilities to do the same things, but it is
perfectly doable if only we have a machine that can really parse our
natural-language input, and this is what my work has provided.
> What happens when hydrogen is exhausted?
>
It may be difficult to make a machine capable of discerning what domain of
discourse is intended with a question like this. My system is able to
handle this kind of question because the user can set up a cyclopedic
reference file for some special domain of discourse, such as astrophysics,
and invoke this file (with the desired domain of discourse) to replace the
cyclopedic reference file that was being used before. However I have not
set up a cyclopedic reference for astrophysics because this would take a
lot of time and effort, and I have had to focus upon other projects. If
you would be interested in equipping Witchit to handle such queries, then I
would be happy to step through the process with you, and answer your
questions along the way.
> Is there Plutonium in the Crab? (I expect context to indicate the Crab
> Nebula)
>
Can't do it without emphasizing some domain of discourse. There might be
plutonium in the crab meat somebody is cooking.
> The answers to all these questions are either contained in Wikipaedia
> or deducible from Wiki.
>
The problem is that Wikipedia was developed without a system such as
Witchit in mind, which makes it impossible to mine such information from
Wikipedia without hundreds of hours of hard work. My hope is that in the
future people will come to understand the theory that I have provided and
then integrate their work with mine, or just start using my system and take
over.
> As you can see I am a little bit sceptical. What you are doing appears
> very similar to what CYC is doing. Are you aquainted with either CYC,
I made the acquaintance of Doug Lenat back when Cyc was in its infancy, but
at that time he was not interested in collaborating with me. Since then,
both Cyc and my own efforts have explored a few dead ends and snags, but I
believe that my system has won the competition because of its innate
simplicity, rigor, and general theoretical soundness.
> or the proof engines contained in Alcor/Mizar
>
I have never heard of these.
> My Arabic passage contains a number of key words. Translated into
> English these are.
>
> Main Sequence, Blue Giant, Red Dwarf, Red Giant, Supergiant, White
> Dwarf, Stefan Boltzmann law, Surface area of sphere
>
> Can you spot these words in context.
Surely.
> You need disambiguation too.
Every word of every sentence that you pass to Witchit is disambiguated
during the parsing phase as follows:
Theorem: Every word of every coherent human sentence ever spoken or written
is simply a linguistic node from which emanate to linguistic links.
(Please listen to my lectures on http://panlingua.net for details).
The first link is to the meaning of the word, which must usually be
selected from among several. The second link is to the regent of the word,
which again must usually be selected from among several.
> On a
> sign at Jeddah airport there is a bilingual sign - Ground
> Transportation (AlArD - note capitals are part of transliteration).
> Tranportation you associate with the ground. size of white dwarfs with
> the Earth. If I tell you they are the mass of the Sun you should be
> able to work out the density and surface gravity. All is in Wiki.
>
> We normally differentiate meanings like that using LSA. You appear not
> to have used it.
>
So what is LSA, and how does it differ from ABC?
> One last point Google has its finger in a large number of pies. It
> owns U Tube. It has got a merchant facility for on line credit card
> transactions. It has Google maps, Google this Google that. Google is
> going to digitize all books. Clearly it has the clout. If you are
> going to start up a search engine you have to prove not only that it
> is a bit better than Google, but a lot better. Your best bet therefore
> might well be to approach Google or Microsoft. They will I can assure
> you be asking the same sort of questions I am asking.
>
I have gone to Seattle and "approached" Microsoft, but found that I
couldn't even get past the first secretary. Google is an entirely
different story, and so help me, I speak of Google with reverence because
of their great service to mankind. I sent an e-mail message to Google, but
the politely informed me that they were not interested in natural language
at the time.
I don't want to destroy google, but it may be necessary to destroy google
to get at Microsoft, and you can bet I would like to get at Microsoft, and
that for many reasons, my main complaint being their monopolistic attitude,
and in this I am far, far from alone.
> Sorry to be so negative.
>
>
> - Ian Parker
Were you being negative? I hadn't noticed that you were negative at all.
In fact I have enjoyed responding to your message very much.
I am quite aware that it will be impossible for one man to achieve my goals
alone, and this is why I have designed Witchit as a system upon which
thousands of people can cooperate from the beginning. I want a computer
that can provide direct, smart answers to any question--not just another
pointing, clicking machine. And I want this computer to be able to speak
ANY human language. Please stick with me and work with me, and together we
will realise this goal. But we will not stop there, because, as you may
have noticed, this is the pathway to true artificial intelligence.
--Joe Devin.
Density = Mass/Volume
You have to find Mass. OK somewhere it is said that it is the Mass of
the Sun (about 300,000 times the mass of the Earth). The volume is the
volume of the Earth. OK the calculation of density is then a trivial
one. I need now to know the density of the Earth.
This is quite a good question. If I say the Mass of the Sun, I need to
know what the mass of the Sun is. Now I may be under a misapprehension
but I thought that this was EXACTLY what you were claiming.
I say what is the mass of X. A – It is Y – What is Y? – It is
300,000Z. What is Z? It is about 7/Volume. Therefore density is
2,100,000.
This is a chain or reasoning that should be possible given a good
semantic model. If it actually calculates the volume of the Earth and
then finds the mass in tonnes OK, it is a little bit more cumbersome
though.
The chain of reasoning is what a proof engine like Alcor will give
you. OK think of semantics as being an extension to Alcor. Think of
the sizes and masses of the Earth, the Sun and a White dwarf as being
“proofs” in the Mizar sense. THESE PROOFS ARE GENERATED BY THE
SEMANTIC ENGINE OPERATING ON (SAY) WIKI.
You are asking for tables. Your claim is that it can find data. Go out
and find them. “Come and get them” as Leonidas said to Xerces. In fact
in the Main sequence
R = M^0.75 L = M^3.5 Hence T = Sqrt(M)
Somewhere there is this information.
I have the vision of being able to define the inputs to my program in
Natural Language and similarly the outputs. Based on NLP I should be
able to run a system of programs in a coordination language like
Manifold. Being able to do that would represent a major advance in
AGI.
A system should be able to cope with a varity of languages. Arabic,
Urdu and Chinese are NIST competition languages. As you seem to come
from the Far East you perhaps should concentrate on Chinese. However
the way to be noticed is to win NIST.
Google’s (and Microsoft’s) Arabic is extremely disappointing. It does
not look at truth in any way. It does not find out things.
- Ian Parker
Thanks for your continuing interest in this project (Witchit search
engine).
You wrote:
> I think I should give you a little bit more on where I am coming from.
> My vision is, if you like of having a system which can REASON.
"Reasoning" is very complex, and can work in various ways (can work by
means of dissimilar algorithms).
> If you postulate a good NLP system, it will come quite close to
> reasoning.
The most important parts of NLP are not reasoning but parsing (machine
understanding) and text generation (being able to generate text output from
internal representations). And contrary to the ideas many people entertain
about intelligence, artificial or otherwise, all other functions rest upon
these two functions and their associated internal data structures and
functions (algorithms). I have included the latter because very
intelligent animals appear to have many of the same internal data
structures and algorithms but lack the ability to parse coherent sentences
or generate them.
Only after these basics are in place (and they constitute the foundations
of human intelligence) can we begin to move on to various kinds of
reasoning.
> I feel I also should have added – “Produce your answers by
> means of a chain of reasoning that is recognizably human and NLP
> based.”
NLP is the door into and out of the human mind, be that mind grey matter or
silicon. Once through this door, no holds are barred. The human mind is
free to use any trick in the book to go after its results. But once these
results have been obtained, they must pass back through the same NLP door
in order to be easily understood.
> Let me go through one worked example - the density of material in a
> White Dwarf. Now
>
> Density = Mass/Volume
>
> You have to find Mass. OK somewhere it is said that it is the Mass of
> the Sun (about 300,000 times the mass of the Earth). The volume is the
> volume of the Earth. OK the calculation of density is then a trivial
> one. I need now to know the density of the Earth.
>
> This is quite a good question. If I say the Mass of the Sun, I need to
> know what the mass of the Sun is. Now I may be under a misapprehension
> but I thought that this was EXACTLY what you were claiming.
Not at all. We do not learn things like material density and the mass of
the sun, etc., from our mamas. We learn them (usually with great
difficulty) in college. Right now I am concerned only with the ability to
understand questions about physics. This, to my understanding, is the real
essence of NLP, not figuring out the actual physics problems. But once the
child mind of the computer can correctly PARSE (or UNDERSTAND) the query,
then it can simply pass the question on to the internal experts (dedicated
programs that know precisely what to do with questions of mass, volume,
density, etc.). Then, when these expert systems have obtained the answer,
they can easily output it on the computer screen.
So you and I are contemplating machines that are very different from each
other in the fundamentals of their architecture. You are envisioning a
machine whose whole function is part and parcel of its ability to handle
NLP, whereas I am envisioning a system that uses correct parsing (machine
understanding) as the gateway into the inner workings of the machine, which
we are free to program in any fashion whatsoever without the least regard
for NLP.
> I say what is the mass of X. A – It is Y – What is Y? – It is
> 300,000Z. What is Z? It is about 7/Volume. Therefore density is
> 2,100,000.
And I care nothing whatsoever for x, y, or z, so long as my machine can
parse the sentence correctly and pass these variables (arguments) on to the
appropriate event handler. Queries are associated with event codes. Thus
when a user asks a specific kind of question that the system recognizes,
the system generates an event code (an integer value identifying the kind
of query or command). Then the system enters a big "switch statement" with
this event code, jumps to the appropriate action for the event code, and
usually passes control to another program called the "event handler"
dedicated to this specific kind of question, along with the parameters
given (in natural language form) by the user.
> This is a chain or reasoning that should be possible given a good
> semantic model. If it actually calculates the volume of the Earth and
> then finds the mass in tonnes OK, it is a little bit more cumbersome
> though.
I think I consider semantics separate from calculation. Most children can
deal with good semantics long before they learn to calculate the answers to
physics questions that even they, themselves, might understand and pose.
> The chain of reasoning is what a proof engine like Alcor will give
> you.
Very well, then give me ALCOR, and I will simply incorporate it into my
larger system by making sure my system can correctly parse the kinds of
natural-language inputs that might be used by ALCOR and passing the
parameters on to Alcor for further processing.
> OK think of semantics as being an extension to Alcor.
Nada. Semantics comes first, and ALCOR and everything else must be based
on semantics, and not the other way round. To understand semantics, please
check out my writings on the ontology and those parts of my lectures
dealing with ontologies. Semantics is the study of meanings, and
ontologies are collections of meanings, and these collections of meanings
can be used by computers for various purposes, the most important of which
is probably parsing (understanding what is beaing written or spoken).
> Think of
> the sizes and masses of the Earth, the Sun and a White dwarf as being
> “proofs” in the Mizar sense. THESE PROOFS ARE GENERATED BY THE
> SEMANTIC ENGINE OPERATING ON (SAY) WIKI.
>
To examine the contents of Wikipedia, semantics would truly come into play,
first of all in parsing the words contained in sentences, and then in
determining what to do with the results of the parsing. However I am not
sure precisely what is meant by "semantic engine." Semantics are at the
foundation of NLP just as, maybe, trees are at the foundation of the
biosphere, yet people do not speak of "tree engines," at least not to my
knowledge. And yet all kinds of things can be done with trees.
> You are asking for tables. Your claim is that it can find data.
This is no "claim." All computer programs can find data, but some data is
easier to find than other data, and the point is to spend a minimum of time
to produce the maximum results. People creating the Wikipedias of the
future should at all times be asking themselves, "Exactly what question is
the information I am writing answering, and how would this question be
formulated by a real human being using real natural language?" If people
would do that instead of being obsessed with other, more pedantic aspects
of style, it would bring us a long way towards being able to get straight
answers from machines.
> Go out
> and find them. “Come and get them” as Leonidas said to Xerces. In fact
> in the Main sequence
>
> R = M^0.75 L = M^3.5 Hence T = Sqrt(M)
>
> Somewhere there is this information.
Go tell THAT to your 5-year-old daughter, and, who knows, she may surprise
you. But then again she may not. Algorithms and calculations are an art,
and not part of the basic human linguistic apparatus. Understanding
questions and answers IS, provided the ontology of the individual includes
a recognition of the meanings involved.
Furthermore, unlike many of my peers, I have no confidence whatsoever in
any theory that sees mathematics as having anything to do with language.
Er, maybe boolean algebra. But many linguists would seem to have
deliberately tried to make their theorizings (and I will not call these
"theories" because they are unworthy of the name) by hiding their own
ignorance and complete absence of scientific rigor behind a mask of
mathematics.
> I have the vision of being able to define the inputs to my program in
> Natural Language and similarly the outputs. Based on NLP I should be
> able to run a system of programs in a coordination language like
> Manifold. Being able to do that would represent a major advance in
> AGI.
I fail to see what "Manifold" might have to do with any of this, and why
what you are talking about would constitute any advance since it is already
being done. Can you please set me straight on this point?
> A system should be able to cope with a varity of languages. Arabic,
> Urdu and Chinese are NIST competition languages. As you seem to come
> from the Far East you perhaps should concentrate on Chinese.
Unfortunately, although I think I am handsome enough, I am surely no fairy,
and therefore cannot add natural languages by the wave of a wand. But in
fact I do rather idolize Chinese girls because of their silky hair and
smooth skin, so I hope the Chinese will remember these things and be
merciful to me and not decapitate me for my anti-authoritarian rhetoric
when they finally take over America.
In fact Mandarin Chinese, as difficult as it sounds, is probably one of the
easiest languages on earth for computers. There are only about 1,600
possible word sounds in the language--so few that a smart native speaker
could probably record all of them in a couple of days. So if you know
Chinese, or have any Chinese friends who are interested, please let me
know. Even just to create a Chinese speech synthesizer with this minimal
efforts might help millions of blind people in China, and I am eager to do
it for nothing in order to repay all the smiles I have received from
Chinese maidens during my lifetime--not to mention all the good Chinese
food I have received at their hands. Come to think of it, I do owe the
whole Chinese race!
> However
> the way to be noticed is to win NIST.
>
And pray tell what is NIST, and how does it differ from ABCD?
Meantime, Google will start noticing me soon enough if only I can get some
of you to cooperate--Which reminds me, I still have to save up some money
for that Glock. Do Chinese girls use Glocks?
> Google’s (and Microsoft’s) Arabic is extremely disappointing. It does
> not look at truth in any way. It does not find out things.
>
>
> - Ian Parker
Er, those who search for truth in Arabic may be easily misled. For THAT,
methinks I will stick to English. But if it is accurate translation you
are looking for, then what say you and I cooperate on Arabic? First we
will build an Arabic corpus and ontology, and then we will set up tables
and algorithms serving as a bridge from Arabian meanings to English and
vice versa. You can build the Arabian corpus and ontology quickly with my
program, Brainchild 5, and then we can work on setting up a translation
table between Arabic meanings and English equivelants. Then we will work
out the ad hoc algorithms that will be needed in addition. And once we
have done all of this, perhaps our leaders will start paying a little more
to what Moslems are saying before shocking and awing them into subjection.
And then, if their arguments are really completely unreasonable,a nd they
really cannot be steered away from their religious fanaticism and their use
of religious fanaticism to sacrifice others in their selfish little wars,
why, we will simply nuke them--but better hurry, because this translation
machine of ours will work both ways, and they may learn things that it
would be better for Arabs and suchlike not to have known. Or what say you,
Herr Professor?
--Joe Devin.
Basically I am interested in what proof engines are doing. Mizar and
CYC/COG had two very different starting points. Mizar is saying "What
are the foundations of Mathematics? You can prove this? Is your proof
valid?" CYC/COG starts off assuming the knowledge of toddlers is a key
stage.
The end result is in fact that the systems are similar and when Mizar
"proves" a result it is not that different from a CYC deliberation.
The only difference being that Mizar will store its cogitations as
proof.
The surface area of the surface of a sphere I can assure you is within
Mizar. The interface is called Alcor and Alcor has the job of
retrieving information. Alcor should respond to 4piR^2. It should
recognize this as the surface area. In fact the surface volumes of
hyperspheres will be there too. As for the Stefan Boltzmann law, the
sound mathematical approach is in fact to start with Clifford Algebra.
This establishes Bose Einstein statistics (two photons can exist in
the same phase space). The derivation will be all there leading to a
4th power law.
To me there is one fundamental difficulty for me in what you are
saying. A toddler does not understand Clifford Algebra, it is not in
his world. However as soon as we start building an inference engine we
have all these concepts to hand. Why not simply use Mizar + Alcor?
Could Mizar/Alcor deal with non mathematical concepts. Yes and no. The
concepts we use has to be described mathematically.
Can we describe "settlements" say. These will occur a lot in UN
bilingual text. We can say that a settlement occupies a given amount
of land, has a number of roads and checkpoints associated with it etc.
etc. It could be encoded into Mizar as a graph. However everthing
needs a formal description and must therefore be placeable in Mizar.
The barrier to me in getting Mizar encodings from Natural Language.
That is not really the case. In fact in translating Chinese - >
English a few words is a distinct disadvantage. It would seem to me :-
1) That one Chinese word means a lot of English words. I have been
told that "chin" transliterated here means "gold" but it can mean a
lot of other things as well. huang chin (yellow gold) being usual. The
"huang" or "yang" "tse" being the yellow river.
2) LSA is the only way to get to grips with Chinese for example at
short range "huang" or "yang" must go to "gold" ("chin" being present)
The NIST results are as follows.
http://www.itl.nist.gov/iad/mig//tests/mt/2008/doc/mt08_official_results_v0.html
It can be seen that the bilingual pair matching (successful in pat for
Arabic) breaks down, at least in part, for Chinese. I am discounting
Urdu in this. The sole interest in Urdu is in fact terrorist plots in
the tribal regions. Arabic and Chinese are both UN languages and
Google has (very roughly) the same training set.
Arabic is yellow and Chinese pink.
I have said for some time that the only effective way to translate is
using LSA. Let us do LSA on an English text. Now LSA has (so far) only
been done at one range. Let us have a number of ranges R(1), R(2), R
(3). We assume that R(1) is paragraph range, R(2) is 20 words either
side, while R(3) represents joined grammatical pairs (verbs/adverbs,
noun strings and adjectives). The beauty of LSA is that we are
performing a matrix diagonalisation, the fact that R(3) will also be
in R(2), R(1) will therefore not matter. Let us get Chinese or Arabic
stems. We have a range of meanings, on our first pass we translate
them with the most probable meaning. We also look for bigrams, not the
sort of n-grams used by Google, but things like "yang chin". In fact
in Arabic "half qatar" is the expression for radius. In our second
pass we find the maximum probability in LSA terms, and we continue
with this process until we get convergence.
In English we have small words which are repeated a lot. These words
are stripped before we start LSA. We finish our translation by
matching small words to small words (Chinese) of inflexions to small
words (Arabic)
- Ian Parker
You write:
> You are making one assumption in what you are saying. This is an
> assumption which a lot of people are making, namely that we should
> progress though the "toddler" stage. Myself I disagree.
Please refrain from putting words into my mouth. I would never make such a
simpleminded assertion. If you will do me the honor of understanding what
I have said before replying, you will see that what I did was to clarify
the difference between our two approaches to AI. Your postings indicate
that you associate things like calculation with the linguistic process. I
demonstrated, and my software demonstrates, that these two components of
intelligence are separate, and that when we understand this, it frees us to
focus on each separately, which we are unable to do as long as we confuse
these two things in our minds and in our work. It is well known that big
unsolvable problems can be solved by breaking them down into little
solvable ones. And this is what I have shown you how to do, although you
seem to have failed to notice.
The problem of parsing is nontrivial, and as I wrote in a previous message,
it has been the holy grail of AI from the beginning. And it (the ability
to parse, for example, plain English sentences) entails a host of other
things that must be understood beforehand. I have solved this problem
(which was never solveable before due to the lack of rigor in previous
methods of analysis), and now I am giving you an opportunity to change the
world by taking advantage of what I have provided.
> Computer
> algorithms are pieces of mathematics and we have therefore to fgind a
> way of translating common sense knowledge into mathematical form.
Utter nonsense. If this were the least bit true, then my Aunt Ginger would
have been a mathematician.
>
> There may be a number of ways of doing this.
There is absolutely NO way of doing this. Where many physicists and
mathematicians have gone wrong is to assume that the universe is "governed"
by mathematical laws. Nada. The universe is the universe, mathematical
laws be damned. Mathematics does not govern ANYTHING: it is merely an
artifact of the human mind created to provide a means of quantifying and
(maybe) understanding observable phenomena. These phenomena are what they
are, and we describe them by means of mathematics, but they are in no way
"subject" to our laws.
> Basically I am interested in what proof engines are doing. Mizar and
> CYC/COG had two very different starting points. Mizar is saying "What
> are the foundations of Mathematics? You can prove this? Is your proof
> valid?" CYC/COG starts off assuming the knowledge of toddlers is a key
> stage.
Once again you are confusing mathematics with linguistics, and in order to
do this you need first to prove the relationship that you are assuming.
Several famous linguists have hidden their ignorance behind the skirts of
mathematics. I contend that mathematics may be okay for describing
physical phenomena, but that mathematics is utterly useless for describing
the inner workings of the human linguistic apparatus--unless maybe you are
talking about something like Boolean algebra. All linguistic phenomena are
based squarely upon the following nonmathematical theorem:
THEOREM: Every word in every coherent phrase of any language can be
completely defined as just a semantic link and a syntactic link emanating
from the same node.
This is the theorem that has been missed by every researcher and
philosopher since Panini and Aristotle, and this is why language has never
(until now) worked on computers. Mathematics has nothing to do with it
whatsoever, but a rigorous building upon this simple foundation DOES. When
people build upon a lame foundation, their structures crumble in the first
breeze, and all attempts at NLP will continue to crumble until this train
comes to a complete halt and people begin to recognize my theorem--because,
after all, this theorem IS mine in exactly the same way as pythagoras'
theorem is Pythagoras'.
> The end result is in fact that the systems are similar and when Mizar
> "proves" a result it is not that different from a CYC deliberation.
Once again, geometry and mathematical proofs have nothing to do with
language. Little babies who know nothing of mathematics know language, so
Cyc and Doug Lenat and MIZAR and their cogitations mean nothing because
these systems do not know language. Are you capable of understanding what
I am saying or are YOU still playing the toddler? Phew, some people do try
my patience awfully badly!
> The only difference being that Mizar will store its cogitations as
> proof.
>
> The surface area of the surface of a sphere I can assure you is within
> Mizar.
Then God bless MIZAR and help him to digest his rubber ball!
> The interface is called Alcor and Alcor has the job of
> retrieving information. Alcor should respond to 4piR^2. It should
> recognize this as the surface area. In fact the surface volumes of
> hyperspheres will be there too.
I do not know what a hypersphere is, but I can write a natural-language
algorithm that will recognize your formula in a couple of minutes, so why
do I need ALCOR?
> As for the Stefan Boltzmann law, the
> sound mathematical approach is in fact to start with Clifford Algebra.
Why even start with algebra at all? My work is linguistics. I made As and
Bs in calculus and probability, but this has nothing to do with
linguistics. Yet my machines can do math that might make some people look
cross-eyed if only I know what input to look for and what output is
desired. Parsing is the province of NLP: mathematics is the province of
computation. I only need to know what arguments to pass to any function,
and the function can take it from there. The problem is to parse
natural-language sentences containing these arguments and to then pass
these arguments to the appropriate function in a form said function can
"understand," and that is what I am doing.
> This establishes Bose Einstein statistics (two photons can exist in
> the same phase space). The derivation will be all there leading to a
> 4th power law.
Tell this to my late Aunt Ginger and who knows, maybe her common sense will
kick in and she will explain what you are discussing, but first you are
going to have to ressurrect her from the dead, and it may take somewhat
more than Bose and Einstein to do THAT!
Ha-ha, just kidding! Of course I must admire Einstein, and Bose too, but
all things have their limits, and linguistics is a region where it does
seem hard for math to go. Yet "weighting" is something. Weighting can be
used to determine the strengths of linguistic linkages, after which
calculations can be used for various purposes--for example when a machine
might safely be able to discard an unused word, or for fuzzy matching
purposes. Yep, that's IT, I do use math in fuzzy matching, so I am forced
to take a step back and apologize. Mathematics is/are useful even in
linguistics, but not in the ways most people think (including Chomsky, dear
Noam please forgive me, I couldn't resist that).
> To me there is one fundamental difficulty for me in what you are
> saying.
You mean what YOU are saying. If I remember correctly, "toddler" is YOUR
word.
> A toddler does not understand Clifford Algebra, it is not in
> his world. However as soon as we start building an inference engine we
> have all these concepts to hand. Why not simply use Mizar + Alcor?
Please bring them on, I'll use them if they work and don't require more
than .5 of Hostmonster's resources!
> Could Mizar/Alcor deal with non mathematical concepts. Yes and no. The
> concepts we use has to be described mathematically.
The word, "concept," has different meanings to different people. In
cognitive science, a "concept" is usually some particular meaning, for
example one of the meanings of "go" or "ham." Such a meaning is equivelant
to a word-sense definition in your dicitionary. In my grand theory of
language (see http://panlingua.net), such a meaning is called a "semantic
node," or "semnod." An "ontology" is a collection of such semnods and the
important linkages between them, for example what IS a something else, what
is a part of something else, etc. The correct parsing of new and
unexpected sentences would be impossible without an ontology.
> Can we describe "settlements" say. These will occur a lot in UN
> bilingual text. We can say that a settlement occupies a given amount
> of land, has a number of roads and checkpoints associated with it etc.
> etc. It could be encoded into Mizar as a graph. However everthing
> needs a formal description and must therefore be placeable in Mizar.
The linguistic encoding of such information requires two kinds of data
structure, namely an ontology and a corpus of parsed sentences, where
"parsed" means having each word reduced to a single node from which emanate
a semantic and a syntactic link. The semantic link of each word is a link
to one of the semnods in the ontology. The syntactic link is the link to a
word's regent (another word in the same sentence) or else to nowhere in the
case of the top word of a sentence.
> The barrier to me in getting Mizar encodings from Natural Language.
>
Now we are back on common ground. You have a mathematical engine of some
kind named MIZAR which does some complex mathematical calculation or other,
and you need to find a way to extract input to MIZAR from plain-english
sentences. My system (Brainchild 5) can do that. Just tell me the generic
kinds of English queries you are expecting, and I will connect MIZAR to
Brainchild or to Witchit or whatever, and bingo, you will have your amazing
natural-language processing version of MIZAR! I am offering to put MIZAR
online and make it available to any scientist in the world who has a
computer and can formulate questions in plain English. This is the power
of Witchit and Brainchild.
Together we can make all of these things available to the world in plain
English right now. And at the risk of being called toddlers, we will
slowly be able to add search engines that can access the same power from
Arabic, Urdu, or Lower Slobbovian! This is the future of real search
engines--not pointing and clicking and remaining glued to your computer for
hours on end getting more and more of absolute nothing. Sorry for being so
rude to the boys back at Google, but they had their chance.
--Joe Devin.
Thanks for your continuing interest in this project (Witchit search
engine).
You wrote:
> I think I should give you a little bit more on where I am coming from.
> My vision is, if you like of having a system which can REASON.
"Reasoning" is very complex, and can work in various ways (can work by
means of dissimilar algorithms).
> If you postulate a good NLP system, it will come quite close to
> reasoning.
The most important parts of NLP are not reasoning but parsing (machine
understanding) and text generation (being able to generate text output from
internal representations). And contrary to the ideas many people entertain
about intelligence, artificial or otherwise, all other functions rest upon
these two functions and their associated internal data structures and
functions (algorithms). I have included the latter because very
intelligent animals appear to have many of the same internal data
structures and algorithms but lack the ability to parse coherent sentences
or generate them.
Only after these basics are in place (and they constitute the foundations
of human intelligence) can we begin to move on to various kinds of
reasoning.
> I feel I also should have added – “Produce your answers by
> means of a chain of reasoning that is recognizably human and NLP
> based.”
NLP is the door into and out of the human mind, be that mind grey matter or
silicon. Once through this door, no holds are barred. The human mind is
free to use any trick in the book to go after its results. But once these
results have been obtained, they must pass back through the same NLP door
in order to be easily understood.
> Let me go through one worked example - the density of material in a
> White Dwarf. Now
>
> Density = Mass/Volume
>
> You have to find Mass. OK somewhere it is said that it is the Mass of
> the Sun (about 300,000 times the mass of the Earth). The volume is the
> volume of the Earth. OK the calculation of density is then a trivial
> one. I need now to know the density of the Earth.
>
> This is quite a good question. If I say the Mass of the Sun, I need to
> know what the mass of the Sun is. Now I may be under a misapprehension
> but I thought that this was EXACTLY what you were claiming.
Not at all. We do not learn things like material density and the mass of
the sun, etc., from our mamas. We learn them (usually with great
difficulty) in college. Right now I am concerned only with the ability to
understand questions about physics. This, to my understanding, is the real
essence of NLP, not figuring out the actual physics problems. But once the
child mind of the computer can correctly PARSE (or UNDERSTAND) the query,
then it can simply pass the question on to the internal experts (dedicated
programs that know precisely what to do with questions of mass, volume,
density, etc.). Then, when these expert systems have obtained the answer,
they can easily output it on the computer screen.
So you and I are contemplating machines that are very different from each
other in the fundamentals of their architecture. You are envisioning a
machine whose whole function is part and parcel of its ability to handle
NLP, whereas I am envisioning a system that uses correct parsing (machine
understanding) as the gateway into the inner workings of the machine, which
we are free to program in any fashion whatsoever without the least regard
for NLP.
> I say what is the mass of X. A – It is Y – What is Y? – It is
> 300,000Z. What is Z? It is about 7/Volume. Therefore density is
> 2,100,000.
And I care nothing whatsoever for x, y, or z, so long as my machine can
parse the sentence correctly and pass these variables (arguments) on to the
appropriate event handler. Queries are associated with event codes. Thus
when a user asks a specific kind of question that the system recognizes,
the system generates an event code (an integer value identifying the kind
of query or command). Then the system enters a big "switch statement" with
this event code, jumps to the appropriate action for the event code, and
usually passes control to another program called the "event handler"
dedicated to this specific kind of question, along with the parameters
given (in natural language form) by the user.
> This is a chain or reasoning that should be possible given a good
> semantic model. If it actually calculates the volume of the Earth and
> then finds the mass in tonnes OK, it is a little bit more cumbersome
> though.
I think I consider semantics separate from calculation. Most children can
deal with good semantics long before they learn to calculate the answers to
physics questions that even they, themselves, might understand and pose.
> The chain of reasoning is what a proof engine like Alcor will give
> you.
Very well, then give me ALCOR, and I will simply incorporate it into my
larger system by making sure my system can correctly parse the kinds of
natural-language inputs that might be used by ALCOR and passing the
parameters on to Alcor for further processing.
> OK think of semantics as being an extension to Alcor.
Nada. Semantics comes first, and ALCOR and everything else must be based
on semantics, and not the other way round. To understand semantics, please
check out my writings on the ontology and those parts of my lectures
dealing with ontologies. Semantics is the study of meanings, and
ontologies are collections of meanings, and these collections of meanings
can be used by computers for various purposes, the most important of which
is probably parsing (understanding what is beaing written or spoken).
> Think of
> the sizes and masses of the Earth, the Sun and a White dwarf as being
> “proofs” in the Mizar sense. THESE PROOFS ARE GENERATED BY THE
> SEMANTIC ENGINE OPERATING ON (SAY) WIKI.
>
To examine the contents of Wikipedia, semantics would truly come into play,
first of all in parsing the words contained in sentences, and then in
determining what to do with the results of the parsing. However I am not
sure precisely what is meant by "semantic engine." Semantics are at the
foundation of NLP just as, maybe, trees are at the foundation of the
biosphere, yet people do not speak of "tree engines," at least not to my
knowledge. And yet all kinds of things can be done with trees.
> You are asking for tables. Your claim is that it can find data.
This is no "claim." All computer programs can find data, but some data is
easier to find than other data, and the point is to spend a minimum of time
to produce the maximum results. People creating the Wikipedias of the
future should at all times be asking themselves, "Exactly what question is
the information I am writing answering, and how would this question be
formulated by a real human being using real natural language?" If people
would do that instead of being obsessed with other, more pedantic aspects
of style, it would bring us a long way towards being able to get straight
answers from machines.
> Go out
> and find them. “Come and get them” as Leonidas said to Xerces. In fact
> in the Main sequence
>
> R = M^0.75 L = M^3.5 Hence T = Sqrt(M)
>
> Somewhere there is this information.
Go tell THAT to your 5-year-old daughter, and, who knows, she may surprise
you. But then again she may not. Algorithms and calculations are an art,
and not part of the basic human linguistic apparatus. Understanding
questions and answers IS, provided the ontology of the individual includes
a recognition of the meanings involved.
Furthermore, unlike many of my peers, I have no confidence whatsoever in
any theory that sees mathematics as having anything to do with language.
Er, maybe boolean algebra. But many linguists would seem to have
deliberately tried to make their theorizings (and I will not call these
"theories" because they are unworthy of the name) by hiding their own
ignorance and complete absence of scientific rigor behind a mask of
mathematics.
> I have the vision of being able to define the inputs to my program in
> Natural Language and similarly the outputs. Based on NLP I should be
> able to run a system of programs in a coordination language like
> Manifold. Being able to do that would represent a major advance in
> AGI.
I fail to see what "Manifold" might have to do with any of this, and why
what you are talking about would constitute any advance since it is already
being done. Can you please set me straight on this point?
> A system should be able to cope with a varity of languages. Arabic,
> Urdu and Chinese are NIST competition languages. As you seem to come
> from the Far East you perhaps should concentrate on Chinese.
Unfortunately, although I think I am handsome enough, I am surely no fairy,
and therefore cannot add natural languages by the wave of a wand. But in
fact I do rather idolize Chinese girls because of their silky hair and
smooth skin, so I hope the Chinese will remember these things and be
merciful to me and not decapitate me for my anti-authoritarian rhetoric
when they finally take over America.
In fact Mandarin Chinese, as difficult as it sounds, is probably one of the
easiest languages on earth for computers. There are only about 1,600
possible word sounds in the language--so few that a smart native speaker
could probably record all of them in a couple of days. So if you know
Chinese, or have any Chinese friends who are interested, please let me
know. Even just to create a Chinese speech synthesizer with this minimal
efforts might help millions of blind people in China, and I am eager to do
it for nothing in order to repay all the smiles I have received from
Chinese maidens during my lifetime--not to mention all the good Chinese
food I have received at their hands. Come to think of it, I do owe the
whole Chinese race!
> However
> the way to be noticed is to win NIST.
>
And pray tell what is NIST, and how does it differ from ABCD?
Meantime, Google will start noticing me soon enough if only I can get some
of you to cooperate--Which reminds me, I still have to save up some money
for that Glock. Do Chinese girls use Glocks?
> Google’s (and Microsoft’s) Arabic is extremely disappointing. It does
> not look at truth in any way. It does not find out things.
>
>
> - Ian Parker
Er, those who search for truth in Arabic may be easily misled. For THAT,
I am not as expert, as experienced, or as qualified as either Chaumont
(Joe) Devin, or as Ian Parker, however I believe you both have a great
deal to offer the field. I believe if you cooperate great things may
come of it, but if you cannot move past your different starting points
and terminology, and if you cannot separate your separate "holy grails"
from what can be jointly achieved in the "next few steps" then that
would be a great loss.
I guess I'm a pragmatist, I personally see more value in the next small
step than in the holy grail, but I believe the "right" small step will
lead us closer to that holy grail, whatever that may be.
Personally, I believe a rich semantic model is the key, and that most
traditional AI / NLP research has been preoccupied with syntactical
parsing and pattern recognition to the exclusion of "comprehension".
I think most researchers have focused on syntax only.
I think Ian has focussed on logic & semantics ahead of syntax.
I think Joe has focussed on semantics ahead of syntax.
I believe that semantics and knowledge representation are the key and
that syntax is a (solvable) distraction.
Like Joe I have also developed a test bench based around semantic
modelling, with more focus on thesaurus structures rather than
dictionaries, and little emphasis on parsing. However this is just a
personal experiment, I am not pushing its merits as such. I don't expect
my test bench to take over the world (so far).
Philosophically I have learnt much from Ian's contributions to this
forum. At the same time my personal test bench apparently has more
parallels with Joe's approach. Like Ian I have at least a basic
understanding of several languages including at least one truly
different language (Mandarin in my case), i.e. a foreign language which
has minimal overlap with your own native language. (e.g. I would not
count French as a truly foreign language since half of English is
borrowings from Old French during the Norman Conquest).
I believe Ian & Joe are both on "the right track" and wish you luck !
Aren't we old friends? It seems like I remember your name from somewhere
out of the past.
You write:
> Personally, I believe a rich semantic model is the key, and that most
> traditional AI / NLP research has been preoccupied with syntactical
> parsing and pattern recognition to the exclusion of "comprehension".
>
> I think most researchers have focused on syntax only.
Maybe the focus on "syntax only" is a phase that we all have to go through
in order to mature to the point where we can push on to deeper things.
Personally, it took me a long span from 1986-1994 to work through that
phase to my satisfaction. I think I always knew I would have to negotiate
semantics sooner or later, but I kept putting it off because it seemed so
horribly complex and there was such a great deal I could do with syntax.
Here in Hawaii, I once audited a class taught by a leading authority on
syntax and case roles, and he seemed to avoid semantics to the point of
paranoia.
As it turns out, language cannot work without the ability to deal with both
components (syntax and semantics) simultaneously, and this is because, as I
finally discovered to my own astonishment:
THEOREM: Every word in every coherent phrase of every natural language is
simply a syntactic link and a semantic link emanating from the same node
(for more see http://panlingua.net).
I discovered this on my own, partly by means of computer research and
partly by synthesizing what world authorities on linguistics had already
said. Someof them were a hair's breadth away from making this discovery,
but were somehow not quite able to listen to their own words carefully
enough to put it all together. As far as I have ever been able to
ascertain, from the time of Aristotle and Panini to my discovery, nobody
was ever able to see this. But now that I have nailed it, it forms the
cornerstone for a radically new understanding of linguistics and AI because
at last we have something concrete to build upon instead of just
conjecture. It is by examining the ramifications of this simple theorem
that we can trace a path through all of the inner workings of language and
completely uncover the semantic structure (the ontology) and the knowledge
structure consisting of words (the corpus) and see how they all fit
together.
And this is something like walking through a portal, or like "walking
through the looking glass" so to speak, because once we realize what words
really ARE, then we can shed their outward trappings (the written or spoken
pattern of characters or sounds) and focus upon their inner workings, all
of which can be modeled in terms of simple links and nodes that work
lightning fast on automated systems. And when we do this, suddenly we find
ourselves within a whole new world of discovery and possibility in
linguistics and AI.
> I think Ian has focussed on logic & semantics ahead of syntax.
But have you ever pondered the possibility that logic might not BE
semantics and vice versa?
> I think Joe has focussed on semantics ahead of syntax.
Yes, but that was between 1986 and 1994. From 1994 to my great
breakthrough in 2004, I focused on both. I wrote taggers and parsers, and
approached the problem of parsing from maybe 100 angles before I got it
right. But as I have already explained, getting it right means both
semantic and syntactic disambiguation at the same time with no "focus" on
the one or the other. They are simply "two sides of the same coin" in
layman's terms. So these two kinds of disambiguation have to be 100%
integrated and going on at the same time in order to parse natural
language.
> I believe that semantics and knowledge representation are the key and
> that syntax is a (solvable) distraction.
No, no, no, no! Syntax and semantics are inextricably bound together, and
it is the artificial separation of these two that has confused people and
kept them back. Please go back to my theorem. Here we are dealing with
rigor and not with idle conjecture, and we simply HAVE to understand what
is going on in order to make things work. The good news is that it is so
simple, if we are only willing to stop for a moment and ponder. Read what
I have written carefully, because it is basically very simple and yet of
crucial importance. You are not really reading what I write, hence your
continuing confusion on these points, which should not exist at all.
> Like Joe I have also developed a test bench based around semantic
> modelling, with more focus on thesaurus structures rather than
> dictionaries, and little emphasis on parsing. However this is just a
> personal experiment, I am not pushing its merits as such. I don't expect
> my test bench to take over the world (so far).
But I do expect this simple theorem of mine to take over linguistics and AI
because it constitutes a major discovery which is being deliberately
ignored and has been deliberately ignored for years, just like Galileo and
his telescope, whereas the truth is that AI and linguists are really at a
standstill and cannot move ahead until people recognize this theorem just
like our knowledge of the Heavens was at a standstill under the cardinals
who refused to look through Galileo's 'scope!
Please consider the facts. It would be ridiculous for masons to try to
build brick walls without knowing what bricks are, and yet linguists and AI
"experts" keep trying to build AI without really knowing what words are.
Just ask any five linguists what a word is and you are apt to get ten
answers. This is NOT the way for people to do science.
> Philosophically I have learnt much from Ian's contributions to this
> forum.
And what about mine? Or have you forgotten those? Go look them up.
People have preserved them all over the web.
> At the same time my personal test bench apparently has more
> parallels with Joe's approach. Like Ian I have at least a basic
> understanding of several languages including at least one truly
> different language (Mandarin in my case), i.e. a foreign language which
> has minimal overlap with your own native language. (e.g. I would not
> count French as a truly foreign language since half of English is
> borrowings from Old French during the Norman Conquest).
>
I have a speech synthesizer running in Mandarin, but it still lacks certain
sounds. Yet Mandarin is one of the easiest languages in the world for
computers because it only has <> 1,600 word sounds, and thus can be
synthesized completely using only 1,600 recorded sounds versus my 70,000+
for English. If I had a Chinese informant, and could get at all of these
speech sounds, which should only take a few days, I would be able to set up
a Mandarin version of my free text editor and speech synthesizer for the
blind. The blind association with which I have been cooperating in Vietnam
claims 50,000 members. Just imagine how many more blind people could use
help with Mandarin. This is a tremendous human resource that is being held
back and is just waiting to be developed. To me it is tragic that the most
elementary needs of so many millions of people are getting ignored while
developers here in America keep shamelessly making money off the blind.
Shame, shame, shame!
> I believe Ian & Joe are both on "the right track" and wish you luck !
Wow, thanks for not wishing me any more hard work because I think I may be
getting to old for this and may konk out before the realization of my
dreams, which is, of course, none other than Cyberwoman, who completely
understands the male apparatus, never cheats, never lies, and knows how to
parse through any kind of astrophysical jargon faster than you or I could
chew celery--even with our dentures!
--Joe Devin.
Not that I recall, but best wishes anyway. I'm in Sydney, I gather
you're in Hawaii.
I didn't intend to be critical, quite the opposite.
I feel there are potential synergies between your ideas and those of Ian
Parker, and others like Mok-Kong Shen. It's not about who is "right" but
more about pooling ideas towards a common goal.
I am very much in the shadow of people like you & Ian, but I'm learning
of lot of different perspectives from reading the forum. I'll read up on
your panlingua papers.
cheers,
Brian
Could I first say a few words about myself. I am a retired scientist
with a strong interest in AI and AGI.
http://sites.google.com/site/aitranslationproject/
My aim in posting is to try to stimulate interest and also to avoid
misconceptions. The amount I can do myself is rather limited, a lot of
what I propose needs quite substantial resources. Not all of it
though. One thing I do propose to do is to take Wikipaedia (the text
used for Hutter) eliminate all the hyprerlinks and try to determine
the sort of compression which would be possible using techniques based
on LSA.
There is one thing about Joe Devlin that disturbs me somewhat. What he
is proposing is NOT a new theory of language, it is in fact a mixture
of rehash of previous ideas and ideas which are in point of fact
included in the concept of LSA. Could I start with some theorems about
Matrix Algebra.
1) If you have identical rows/columns the determinant of the matrix is
zero.
2) If you take a number of ranges and place them in LSA, this includes
words immediately preceding/following a word. Noun adjective strings
for example, you need not do anything else.
On “2” we may ask why not? Surely if I take a range of +/- 50 words
say a word following another word is going to be within 50 words? Yes
but remember we need not take A-B in our rows/columns. After all
Cholsky’s method, the standard method for matrix diagonalization does
this for us. Cholsky reduces a matrix to tridiagonal form and the
eigenvalues are then found iteratively.
The basis of LSA is to take the eigenvectors and truncate them. By
doing so we are implicitly eliminate synonyms. Can you see why? A true
synonym will produce identical rows/columns and therefore a null
eigenvalue.
I think that what is much more to the point is taking dictionary n-
grams. I say DICTIONARYY n-grams, not the more generalised
characterisation of n grams of Google and give them a separate
identity. The next stage is to do clustering (k-means) on these words.
Let me give some examples. Blue Giant is a music group, Red Dwarf is a
comedy SF programme on television. Their vector values in a cluster
will be very different from that of stars. K-means will have no
difficulty.
Having done k-means I will now have a set of unambiguous concepts,
which I can use in terms of truth. I will in fact have 2 kinds of
truth.
1) Truth like the Stefan Boltzmann law or surface area of a sphere
2) Truth that has been ascertained in the current document/
conversation.
“2” is very like “chat”. My name is Ian. The computer says “Ian .....”
I do have a model of language I don’t know how original it is. Let us
take an expression suppose I say (French) “Je m’appelle Ian” I can now
rewrite that sentence as “je m’appelle {Proper Name}” Now I can treat
{Proper Name} as a single entity for purposes of LSA or other sorts of
linguistic analysis. If I know who is talking - who “je” is I might be
able to fill it in, but I assume I can use a set of a large number of
possible proper names. Suppose I say {Personal Pronoun}-{appell}
(Proper Name}.
You can perhaps see a little bit of an Arabic mindset here. Most
French language teachers quote the infinitive “appeller” I am quoting
the stem. The “-“ says that the personal pronoun has to agree with the
inflexion on “appell”. I can also say {Country[poss]} troops killed
Karen villagers.
I can detect this in a text in 2 ways. I can go though all the
permutations of “je m’appelle Ian” and put all these things in a Hash
Table. This is the way Google translate works, and is a possible
method if you have terabytes on your server. BTW – You have to meet
the permutations in actual text, Google will not work out anything
generic from statements like those I have given. However apart from my
stem convention, that was the way I was taught French at school.
These examples are all areas where Google has fallen down. The
statement that US troops committed atrocities in the Far East has
damaged Google’s reputation. Conspiracy theorists will see it as
Google toadying up to dictators. In fact it is simply because they use
brute force with quite a lot of ignorance.
Dictionaries of expressions are found in every dictionary monolingual
and bilingual. The difficulty is the amount of work needed to extract
them and encode them in this form.
I did in fact try out Wichit. I asked what a “black hole” was. It
simply gave me the dictionary definition for “hole”. It did not look
it up as a bigram. To me this is unsatisfactory. What we need is a set
of expressions. OK, these are specialised words although “trou noir”
is in the dictionaire de la lange francaise. I also tried “pot hole” a
fairly ordinary expression, again no go.
Can Google be beaten? With the best will in the world this cannot beat
a Google that has been giving correct n-grams for yonks. It should be
remembered too that Google is a vast machine with a great many
products. Despite my strictures on Arabic I have to admit that Google
are handling large distributed databases seamlessly. Compare this to
the British Government that has squandered billions on database
systems that don’t work. If you look say at NHS records Google stands
out as a shining success. Anyone who challenges Google will have to
take all this on board.
There are other models for large databases. There is the Ocean Store
http://oceanstore.cs.berkeley.edu/ model. This is in some respects
slightly better than Google. It allows every PC to become a server,
it allows you to construct virtual supercomputers, something which
Google as it stands will not do. It features multiple redundancy of
data. However Google is a tried and tested system.
It should also be pointed out that a competitor to Google not only has
to be better than Google, but very much better. Ocean Store is a lot
more logical and would allow for a search engine to be split up into
basic operating system and other areas. The long term effect Google is
having on competition does worry me.
- Ian Parker
Thanks for your criticism. I will attempt to respond as follows:
First of all, I have nothing personal against Google, but intellectually I
do have a problem with systems like Google because they fail to provide
much information, but instead only pass your question off onto somebody
else to answer. In my opinion pointing to other sources, resources, etc.,
is not exactly the way to solve the information problem, especially when
those sources/resources to which the buck is passed are mostly
advertisement and have to be re-edited by the user in an attempt to
ascertain what is really valid and what is not. This kind of system is
simply too slow for the future modern world. People do not need to have
yet one more thing tying their behinds to their computers and their mouses.
They need to be able to ask a sane question, get a sane answer, and move on
to other matters.
But having said all of this, once again I have nothing personal against
Google, but instead hold Google in some awe for its great service to
humanity--including myself.
What I am really after is Microsoft, because Microsoft has cost me a lot
ofheartache and kept me from certain important long-term goals, and if I
can live long enough, I hope to get them for that if I can. However to get
at Microsoft, it may be necessary first to bury google, which I really feel
reluctant to do, but may have to. The reasons for this are complex, and
not relevant to our current topic, so I will leave them alone for later.
Now at the beginning of all this I stated clearly that Witchit was still in
the first stages, experimental, etc., and I have asked for cooperation to
build it, because it is clear that although I may have the direction right,
I will not be able to gain much momentum on my own. But instead of
offering me cooperation, you have gotten Witchit into your gunsights and
plan to shoot it down. Why you would do this it is hard for someone like
myself to imagine, but there it is. Reality staring me straight in the
face, once again those who should be working with me trying to shoot me
down for no apparent reason. But please don't cry, I am well used to this,
having been through it over and over before. And if there is any way of
digging up all those old postings of the past, you will see how many people
got themselves singed by trying to win flame wars with, er, yes, ahem, with
ME. The reason is that although I be an emotional being just like
everybody else, I never let my emotions becloud my reasoning in the same
way others do. No matter how hateful and bitter my enemy, I will be the
first to acknowledge his/her success whenever he/she arrives at any truth,
whereas my opponents tend never to concede anything, and thus quickly
entangle themselves in their own confusions which they themselves have
created.
So while we are on this subject of confusion, I would like to remind you of
the principle of Ockham's razor. Why drag complicated mathematics into the
field of linguistics (including igenvectors) if they offer no clear
advantage and don't fit all of the facts? What can you do with igenvectors
that I cannot easily achieve by simpler means? Are you not a bit confused?
And now, having rebuked you as far as I deem appropriate for this time, I
will continue to respond:
Ian Parker wrote:
> There is one thing about Joe Devlin that disturbs me somewhat. What he
> is proposing is NOT a new theory of language,
It most certainly IS, why else the rejection?
> it is in fact a mixture
> of rehash of previous ideas
Too bad Herr Albert Einstein is no more around, else you might say the
exact same thing to him, and perhaps deflate his ego a little!
If you will bother to read Samuel Clemens (who seems to have been somewhat
ahead of his time), he once said that William Shakespeare NEVER CREATED
ANYTHING. If you could have put him on a desert island, continueth Herr
Clemens, he never would have written anything. All he did was to
synthesize the things he already knew.
And of course I am quite guilty of the same, which you would have known by
my own admission if you had only bothered to read my
http://panlingua.net/txt/sources.txt, in which I make a clean breast of
everything.
> and ideas which are in point of fact
> included in the concept of LSA.
Er, I am afraid not.
> Could I start with some theorems about
> Matrix Algebra.
Surely, but there you must end, because you have no understanding of
linguistics even if your algebra doth shine.
> I did in fact try out Wichit. I asked what a “black hole” was. It
> simply gave me the dictionary definition for “hole”. It did not look
> it up as a bigram.
I think you are implying that Witchit is incapable of handling compound
words. Utter nonsense. The reason Witchit couldn't answer your question
correctly is simply because no one has ever bothered to add "black hole" to
the lexicon, which, as I have already said, is currently working on less
than 20,000 words. Ask yer 5-year-old daughter what a black hole is in the
shower, and she will probably look between her legs. Really, I am shocked
at this kind of petty shootemup attitude used upon someone who has
approached you in good faith for cooperation in his lonely efforts to bury
Google.
> To me this is unsatisfactory.
Of course it is unsatisfactory. Why do you think I would go to the trouble
of posting requests for cooperation if I had already buried Google? Like
Samuel Clemens, dear Professor, you are running ahead of your time. But
give me time, and a tidbit of cooperation instead of streambed gravel in my
mouth, and I will show you great things.
> What we need is a set
> of expressions.
What I and the world need is a set of straight answers, and I am going
after them on behalf of the world, but at this rate may be long dead before
I get even a few of them! So is this the age of information, or shall we
just be truthful and call it the age of more lies and disinformation? Old
lies, mind you, dusted off, fixed up, and once again shining!
> Can Google be beaten? With the best will in the world this cannot beat
> a Google that has been giving correct n-grams for yonks.
May Google continue to spew correct n-grams while I go after knowledge.
> It should be
> remembered too that Google is a vast machine with a great many
> products.
True, and this is probably the reason why I haven't buried Google--yet.
Would you agree, Herr Doctor? But even great rock faces can be undercut by
nothing more than waves, and I believe that natural language is the wave of
the future, and this seems to be the belief that has been causing all these
waves.
> It should also be pointed out that a competitor to Google not only has
> to be better than Google, but very much better.
Anything that could give straight answers instead of yet more links would
be much better than Google, no Herr Professor?
> Ocean Store is a lot
> more logical and would allow for a search engine to be split up into
> basic operating system and other areas. The long term effect Google is
> having on competition does worry me.
But not me, and I am sorry Google will have to be buried unless Google
comes around. Heck, I gave them a chance a long time ago, and they
wouldn't come on board, so now it's going to be all up to them. But if we
are going to bury Google (which I really do not want to do but may have
to), we are going to have to stop squabbling and work quickly. You want to
post about math on comp.ai.nat-lang, which is about computational
linguistics. I am looking for people who really believe in the power of
natural language and wish to put it to work NOW, and not 50 or 500 years in
the future.
Sincerely,
Chaumont (Joe) Devin.
I think you need to look at bigrams. BTW - You know the story of
Salome "half of my kingdom". What is half of Qatar?
String=nSf
half
middle
semi-
justice
we arrange[Imperf]
we classify[Imperf]
we be pure[Imperf]
we clarify[Imperf]
we purify[Imperf]
we liquidate[Imperf]
we choose[Imperf]
we prefer[Imperf]
we describe[Imperf]
we characterize[Imperf]
14 entries
String=qTrh
you[sm] trickle him/it [Perf]
you[sm] drip him/it [Perf]
you[sm] make drip him/it [Perf]
you[sm] make trickle him/it [Perf]
Qatars[m,p] *****************
drippings[m,p]
tricklings[m,p]
drops[m,p]
trains[m,p]
trains[m,p]
regions[m,p]
districts[m,p]
bigram - radius ****************
12 entries
Interesting one! Can have a lot of meanings that only LSA can
disambiguate. This is an example of a compund word in Arabic. Think,
what if Salome was offered the radius of the table she was dancing on!
Again the transliteration is Buckwalter. Arabic script will no go into
this usergroup anyway.
I feel that anyone who goes into language will have to work broadly on
the lines I am suggesting. It is a goal that will involve a lot of
work and I am beginning to doubt whether an individual can do it. It
would help if we had a dictionary we could go through.
I never suggested you had got anything against Google. The thing I
have against Google is a very generalized anti trust view. Google may
not (yet) have abused their power, although as you rightly say
Microsoft certainly has. The problem for regulators, and the citizen,
is what to do with a situation where a monopoly is natural. Do you
nationalize? Not if there is any alternative. Bing the Microsoft
engine is stirring Google into action, this can be nothing but a good
thing. Mind I would prefer the competitor NOT to be Microsoft, but you
can't have everything.
I don't know whether we should be discussing how to get support etc.
in this usergroup, but in view of my assertion that this is too big
for one person perhaps I should. Some time ago I read an article in :-
(http://www.zawya.com/story.cfm/
ZAWYA20090829114306#commentB090831114847)
It was about Arab women in Science and Technology. I made a comment to
the effect that the problem was to get Arab scientists of either sex.
I have also been looking at the entries (Arabic) in the NIST
competition and I am rather disappointed to see so few entries from
the Arab world. I feel a research project into linguistics would have
obvious attractions. I feel the Arabs should be the primary custodians
of their language OK we look at Gulf Arabic.
The other attractions are that a project of this type could be done at
a number of levels. As a project students could look at n-grams in a
particular area. At a higher level we would have LSA and k-means.
Just a thought.
- Ian Parker
> I think you need to look at bigrams. BTW - You know the story of
> Salome "half of my kingdom". What is half of Qatar?
I'm not quite sure. Tar maybe? I hear they have a lot of that down there.
> we liquidate[Imperf]
Oops! This is what keeps bothering me about things Arab and Arabic.
Please consider somebody else for the time being in order to give my
arthritis a chance to get a little worse, then you can have me.
> Interesting one! Can have a lot of meanings that only LSA can
> disambiguate.
So this would seem to mean all Arabs must have built-in LSA. Pretty scary.
So where do we go from here--if not into the bosom of Allah?
And re Salome? If she was really as beautiful as king Herod thought she
was, then I would settle for her right now, even if she only had one leg or
one whatever! Pure ecstasy, I say!
> The problem for regulators, and the citizen,
> is what to do with a situation where a monopoly is natural. Do you
> nationalize? Not if there is any alternative.
So we make an alternative.
> It was about Arab women in Science and Technology.
Scarce as hens' teeth! And they deliberately cut off their poor clits when
they are little girls to make them useless in bed.
But I once saw a stewardess back on Mideast Airlines who looked awfully
useful. In fact I was only 15, and she was so beautiful that I was scared
to look at her except to steal a few glances. No wonder the poor Arabs
have to cover up their faces, keep them off the streets, etc. With eyes
like that looking at them, it is truly a wonder that ANY men can get any
science done at all, even though they keep them covered almost all the
time. It's seeing them uncovered that sticks in the ontology, tortures the
Hell out of you day and night, and finally drives you on to suicide and
terror. Instead of fighting these poor arabs, we should really just trade
women with them for awhile in order to promote some better understanding.
However it might be difficult to make them fit into high heels without
abundant prayers to Allah and a Roget's Collegiate Thessaurus, not to
mention an annotated interlinear translation of the ancient texts. It's
really no wonder at all that Mohammed was a rapist and went after
9-year-old girls when you stop to consider the confusion wrought by Allah
in his mind. The problem is that I will never know whether all of them are
as stunning as that one I saw on Mideast Airlines because they keept he
rest of them all covered. Boy could I be in for some surprises, and not
all positive, as you might well imagine. If only there were some way.
BTW, is it true that the American military has perfected an undercover
camera that can see right through the robes? Now THAT would be some secret
weapon! But the adjustment would have to be a little delicate so as only
to penetrate the outer garments and not to penetrate the skin. But knowing
some of these old American army sergeants (Americans always think more is
better), they would turn the volume up so high that the dern thing would
penetrate straight into the bones, and all they would be left with would be
a study of female osteoanatomy. But who knows, some US army sergeants
might actually like this, just like they like to turn their boom boxes up
so high that the paper starts to tear right off their speakers. More is
better, joy oh joy!
And what, you ask, about the creatures themselves? Well, if I had a body
(not to mention a pair of eyes) like that stewardess back on Mideast
Airlines, it would confuse me so badly that I would never again be able to
do any science at all for the rest of my life (shudder, shudder). All I
would be able to do henceforwards would be just to quote verses from the
Koran and reproduce, reproduce, reproduce! And maybe this is why you are
not seeing Arab Women in the NIS, er no, I mean the NLA, or was it the ITJ?
Anyhow, you must know what I mean.
--Joe.
No, this is an intermediate stage. "you will liquidate" is simply a
verb with its persons described. Let us take an example in Latin
Gaius Mariam amat - Gaius loves Maria. Now "amat" in this form would
be "he loves" he being Gaius. In the actual code this is described by
bits.
The important thing to realize is that we have a bigram here.
I know what you are saying. Statistics tell us that the Islamic world
taken as a whole produces far fewer scientific papers than Israel.
Islamic populations have a lower standard of living than non Muslims
where communities are mixed as they are in the Far East, and also
Western countries.
At the end of the Second World War Jews arrived in the Middle East
with just the cloths they stood up in. Today Israel is far more
productive than any Arab country, but it should be pointed out that it
did not start up that way. Israel stared off as very much the poor
relation of the Arab world. The Arabs are quick to condemn Israel, but
they seem blind to the faults in their own society and ideology.
However when the Arab world acknowledges that it lags behind
scientifically I feel they should be supported, although obviously not
non critically. In Israel you are encouraged all the time to reach
your potential. This must also be the case in the Arab Middle East. I
feel two things about Haya bint Al Hussein. The first thing is the
enormous difficulties, cultural and religious, of achieving what she
wants, or appears to want. The second thing is that you cannot simply
have more women in science and expect the rest of culture to remain
the same. The culture of boardrooms (in the West) that have got women
on them is different from those who have not.
I suspect though that things have not been thought through. Let us
look again at
http://docs.google.com/Doc?docid=0AQIg8QuzTONQZGZxenF2NnNfNzY4ZDRxcnJ0aHI&hl=en_GB
He tells us that he is going to look at the evolution of stars, why
there are giants and white dwarfs but does not. Has someone “got at”
him. It came from a student group in Damascus. Syria is in fact one of
the more liberal countries about that sort of thing. When I had a
holiday in Syria though I rather put my foot in it when I asked
whether people really believed that Adam and Eve were real people.
They do implicitly, even more fervently than they do in the US.
Basically I am interested and sceptical at the same time. You cannot
train people scientifically if they feel unable to ask questions. I
suppose you actually have to be a scientist yourself to have gone to
that point. I do not expect that anyone at the conferences that she
sponsors will ask these types of question. Dawkins seems to me to be
the only person who ever talks about real issues.
I must say I do not like the idea of having to square things with an
"infallible" book. Although the 8 says of creation are epochs, and the
solar system arose from "dukhan" which means smoke but can mean gas or
vapour. Muhammad got the correct theory of the solar system, although
more by luck than judgement.
- Ian Parker
I find your experiences with the Arabs most fascinating, and now wish I had
had more time in the Middle East in order to better understand them and
perhaps kidnap a few young girls in order to save them from certain
clitorectomy. God knows the ones I have seen were good-looking enough, and
I have a natural instinct to love and protect whole harems, but unlike my
godly Arabian brothers, I insist upon the unclitorectomicised, or raw,
version.
I am not exactly sure what you mean by a bigram, however it would appear
that you are getting at the morphological and syntactic agreement required
by many languages. Morphology (the "shape" of a word) is useful in
learning new languages, but not very useful in computational linguistics,
and certainly never to be relied upon. In Malay, "malu" means shy or
ashamed, and nouns are formed from adjectives by the ke-adjective-an
construction. So there was once this American missionary leading a good
old-time revival meeting on the island of Ternate, where he found that the
locals were reluctant about getting up and "testifying" about all the good
things the Lord had been doing for them. So what's the matter with you
people, anyway? Have your ke-malu-ans got you bound to your seats?"
Unfortunately Rev. Sorbo was not aware that in this particular case usage
trumps morphology, and "kemaluan" means "genitalia." So he was up there
asking those poor natives whether their genitals had them bound to their
benches.
But as far as noun-verb combinations in these languages like Latin that
demand agreement, I cannot agree that such bigraphs can be seen as single
entities or single words. They are just made to agree at a superficial
level in order to provide redundancy to the transfer of information, but
internally they still constitute two words and obey my theorem, which of
course states that:
THEOREM: Every word of any coherent sentence ever uttered by man is simply
a semantic link and a syntactic link emanating from the same node.
So Markus links semantically to a special node in the ontology reserved for
that individual or else for Markuses in general, the link type being
"masculine name," and "loves" simply links to the semantic node for that
action, the link being of type 3rd-person-present-tense or whatever it may
be in Latin. Syntactically, Markus links to loves with a link of type
"subject," whereas loves links to nowhere with a link type of "verb of
declarative sentence." Now dress these words up in any kind of morphology
you like as in Greek or Latin, but the same linkages hold.
It is important to see that words are not really the external symbols,
which come dressed up in morphologies designed for agreement, ease of
deducing syntactic role, etc., but the links that underpin them. Then, in
the process of understanding, these underpinnings (which are the real
universal grammar that Noam Chomsky was after but could never find) are
determined, the meaning is established, and the external word symbols are
forgotten after a few minutes whereas the memory of the meaning remains in
the mind.
--Chaumont Devin (alias Joe).
I am simply going on some well known facts. I find it incredible that
a handful of refugees from Europe should be able to take on and win
against vastly superior resources both in terms of manpower and
petrodollars.
>
> I am not exactly sure what you mean by a bigram, however it would appear
> that you are getting at the morphological and syntactic agreement required
> by many languages. Morphology (the "shape" of a word) is useful in
> learning new languages, but not very useful in computational linguistics,
> and certainly never to be relied upon. In Malay, "malu" means shy or
> ashamed, and nouns are formed from adjectives by the ke-adjective-an
> construction. So there was once this American missionary leading a good
> old-time revival meeting on the island of Ternate, where he found that the
> locals were reluctant about getting up and "testifying" about all the good
> things the Lord had been doing for them. So what's the matter with you
> people, anyway? Have your ke-malu-ans got you bound to your seats?"
> Unfortunately Rev. Sorbo was not aware that in this particular case usage
> trumps morphology, and "kemaluan" means "genitalia." So he was up there
> asking those poor natives whether their genitals had them bound to their
> benches.
Suppose I use the term "black hole". This is a body whose
gravitational field is so strong that light cannot escape. For
pedantic purposes I feel I should stress that the definition I am
giving here is different from the way Google and a lot of people refer
to bigrams and n-grams. Google simply refers to pairs of words. In the
definition I am putting forward this pair of words must mean something
distinctive. In fact it is possible to have a language in which "black
hole" is a single word. The bigram can also be used to indicate a hole
in accounts.. The treasury has a black hole.
I am stressing a single word. In English radius is one word. In Arabic
it is 2 it is like "black hole". Hence my little joke about Salome and
half of Qatar. "A supernova can leave as a remnant either a "neutron
star" or a "black hole". In fact the single word "pulsar" is a single
word which means (in effect) neutron star. There are many other
expressions. We would not translate "lock, stock and barrel" we would
find some sort of equivalent If you go through a bilingual dictionary
you will find many such expressions. Another example look up "cat's
cradle" in a French dictionary. The literal translation is "berceau du
chat" but the correct translation is "joue de scie". The translation
of "la guerre des moutons" is "wooly bully". It should be pointed out
that this translation is done by treating the expression as a single
word and finding an expression which is equivalent.
In Google n-grams an attempt is made to find a correspondence between
the two languages. In n-grams as I have defined them there is no
correspondence as the expression is completely different.
>
> But as far as noun-verb combinations in these languages like Latin that
> demand agreement, I cannot agree that such bigraphs can be seen as single
> entities or single words. They are just made to agree at a superficial
> level in order to provide redundancy to the transfer of information, but
> internally they still constitute two words and obey my theorem, which of
> course states that:
>
> THEOREM: Every word of any coherent sentence ever uttered by man is simply
> a semantic link and a syntactic link emanating from the same node.
OK but this semantic link is modified by context. Are we talking about
collapsing stars, the centre of the Galaxy, or are we talking about
the incompetence of Gordon Brown. (black hole). If I say that the
black hole is 3 times the mass of the Sun, I know that it is a
supernova remnant. If it is 16 billion £UK a month I know Gordon Brown
is incompetent. If I say 100 million times the mass of the Sun I know
it is a supermassive Galactic core. There is truth here as well. How
heavy are supernova remnants? How heavy are galactic cores? What is
the state of the British Economy?
>
> So Markus links semantically to a special node in the ontology reserved for
> that individual or else for Markuses in general, the link type being
> "masculine name," and "loves" simply links to the semantic node for that
> action, the link being of type 3rd-person-present-tense or whatever it may
> be in Latin. Syntactically, Markus links to loves with a link of type
> "subject," whereas loves links to nowhere with a link type of "verb of
> declarative sentence." Now dress these words up in any kind of morphology
> you like as in Greek or Latin, but the same linkages hold.
>
> It is important to see that words are not really the external symbols,
> which come dressed up in morphologies designed for agreement, ease of
> deducing syntactic role, etc., but the links that underpin them. Then, in
> the process of understanding, these underpinnings (which are the real
> universal grammar that Noam Chomsky was after but could never find) are
> determined, the meaning is established, and the external word symbols are
> forgotten after a few minutes whereas the memory of the meaning remains in
> the mind.
>
No an atom is a n-gram this can be a single word or it can be a
compound word. The meaning of that word is set by LSA which is a
metric of content.
I visualise something quite complicated and involving quite a lot of
effort. I think I can see my way through it, but it is far too much
for a single individual. I hope everyone can see this.
- Ian Parker
I wanted to rename this response to something like:
Subject: Pragmatics, Word Boundaries, and How to Select Meanings
or something like that, but I don't want to destroy this thing we call our
"thread."
Anyhow, here is the real skinny:
You are right that meaning is selected by content, but this process would
seem to have nothing to do with mathematics. At least I finally figured
out how to do this, and my systems really work, and they do not use
mathematics except in calculating a fuzzy-logic value for best-fit
evaluation. My systems are capable of comparing an input sentence against
thousands of previously parsed sentences in a data structure called the
ontology in less than a second in order to come up with a best fuzzy match.
That match is then used in different ways depending on how good it finally
turns out to be. The corpus is theoretically the collection of all of the
sentences the system has ever correctly parsed from the beginning, but I
fudge this and get rid of redundancies in order to make my system faster.
The mapping of natural-language to the corpus and vice-versa is kept
carefully exact so that the whole corpus can be printed out in plain
English at any time, but in its internal form (Interlinguish encoding) is
maintained in a form best suited for high-speed searching and matching, but
in such an elegant fashion that all words are exactly six bytes long, and
no meta-information is necessary (the number of six-byte internal words
(atoms) is exactly the same as the number of words in the external texts
except for bigrams, which are represented by single six-byte atoms just
like any other word).
And after having said this, then it is necessary to clarify the rule by
which bigrams and trigrams are determined, and it is this: When it can be
shown that two or more words do not have any syntactic links between them
because they cannot be fitted into any kind of dependency structure in
which one word is dependent and another word is regent, then that group of
"words" separated by spaces in texts is actually a single word. The most
common ones are two "words" long. More rarely three, as in the latin names
for species, and sometimes even four, but five is so rare an event as for
most intents and purposes to be safely ignorable. My system handles this
problem by grabbing as many words as possible, starting with a potential
"word" size of about 32 characters and going down from there, and this
works for most things.
So to give you a hopefully clear example, take the word, "machine gun."
Now we know that all guns are in fact some kind of machine, so it is clear
that in this case "machine" cannot be a modifier, else we would have
"machine cars," "machine boats," "machine robots," "machine airplanes,"
etc. So "machine" is clearly not a dependent of "gun," ence "machine gun"
has to be a single word. "House cat" is a little more difficult, and shows
how compound words can eventually arise from words that are really
dependents of each other. Can "house" be a modifier? You bet. We have,
for example, "house paint" and "home furniture." So what happens in these
interesting cases? Is "house" also an adjective? Not really, and the
answer leads us straight back to my horrible old theorem:
THEOREM: Every word in every coherent sentence is just a semantic link and
a syntactic link emanating from the same node, the external symbols or
trappings, be they audio or morphological or textual or whatever ultimately
having nothing to do with the internal structure.
Armed with a clear understanding of this theorem, it is possible to explain
many heretofore mysterious linguistic phenomena, this matter of the "house"
in "house cat" being one of them. Why? Because here "house" has
conflicting semantic link and syntactic link types. Noun modifiers of this
kind usually have a semantic link type of "adjective," so when they have a
type of "singular noun," which is perfectly possible, they confuse us
because traditional grammar does not allow one word to have more than one
part of speech. But in the original "house cat," the word, "house," has a
semantic role (link type) of "singular noun," and a syntactic role (link
type) of "noun modifier," and this baffles linguists because we have a gut
feeling that this "singular noun" should somehow be an "adjective," so as a
compromise in our dictionaries we may call it a "modifier," which is kinda
close to "adjective" without really committing us because in fact without
my damned theorem we are quite ignorant of what is going on and dare not
say definitely what it is. It is surely being used as an adjective, but it
is surely a noun, and so we don't know what it is until we understand that
words can have "conflicting" semantic and syntactic roles.
The same confusion is triggered by auxiliary verbs. In our little heart of
hearts we know that they have to be verbs, and yet they are obviously being
used as adverbs, so once again we find ourselves caught between the old
Devil and the deep and don't know what to do with them, so we politely call
them "auxiliary verbs," and everybody agrees to ignore the problem and move
on leaving it unresolved, when if we had stopped and resolved the problem
we might have discovered my theorem much faster.
So "house cat" has become so engrained as a single entity in our poor minds
that we can't really separate it into two words anymore like we once could,
and "house" is no longer a dependent of "cat" in many minds, so in the
minds where this has happened, "house cat" is a single word, and maps
cleanly to a single semantic node in the ontology.
But "house paint" may continue to be two words, because we are thinking,
"Er, what kind of paint have I been after? Was it boat paint? No, it was
HOUSE paint." So in this case "house" remains a separate word that is
essentially a noun but serves us also as an adjective because its case role
is "noun modifier."
So that is the rule. If one word is really acting as a modifier to another
word, then that other word is its REGENT, there is a dependency
relationship between them, and they are in fact two words. "I'm gonna buy
a can of paint. What kind of paint? HOUSE paint." The word paint can
work by itself or it can be modified by "house."You can't say, "I'm gonna
buy a house" without changing the meaning. But you can say, "I'm gonna buy
a machine" when speaking of machine guns, because "machine" is not a
dependent of "gun" but part of "machine gun."
So we have a rule, but we also have arbitrary choice which may differ from
individual to individual within the same language for a time, as in "house
cat," which some people may think of as a particular kind of cat species
whereas other people might think of it as just a cat that hangs around
houses. So although the internal structure may differ from person to
person, the same old universal system, or universal grammar, is just being
used a little differently from person to person while the universal grammar
itself remains the same.
You wrote:
> No an atom is a n-gram this can be a single word or it can be a
> compound word. The meaning of that word is set by LSA
Nada. These things are not set by LSA or ALS or XYZ, but only by usage and
by the completely unpredictable ways in which people come to think of
things. When you can predict the female mind, be it Western or Arabian,
then and only then will I concede that you can predict language by LSA or
XYZ. Language is a living organism and completely unpredictable by
mathematical means. Just get over it and move on.
And now I will address this other issue you brought up known as the idiom.
In a phrase like "kicked the bucket," each word is a real word, and the
dependency relationships all "appear" as usual, but the whole phrase means
something else which cannot be understood unless one is familiar with the
subculture or culture that spawned it. In most languages we have a rule
stating that subtrees in a parse tree map cleanly onto segments of text.
Thus in most cases the machine can simply check its list of idiom subtrees
and make clean substitutions. But for a few languages, such as Koine
Greek, this may not be possible because of the phenomenon I call
"interleaved dependency," in which the rule of subtrees mapping cleanly
onto text segments won't work. I still don't know what to do with
languages like that, and have found no elegant way of handling themÀusing
mysystem (can't represent them in Interlinguish for processing, and linked
lists are too bulky and slow).
And finally PRAGMATICS. A lot of what we say has different meanings
depending on context, as you have pointed out. Linguists (may God bless
their little hearts and souls) call this phenomenon "pragmatics." Many
times we can understand what is intended just by backing up a little. For
example, have you ever started reading a short story, gotten a few lines
down, failed to understand something, and glanced at the top again? Same
thing. We can do this with computers by keeping a log (the corpus) of all
parsed sentences which it is easy for the machine to reexamine in order to
parse the current one. But when this involves looking around at the
flowers and trees or reading the expression on the lady's face, then,
sadly, my machines are out of luck. Nevertheless machines will almost
certainly be able to do all of this and more in the future, and it is
problems like this that keep the field exciting. So who will get there
first--me and my ontology or you and your LSA? We shallsee, we shall see.
> which is a metric of content.
Uhuh, voodoo again.
> I visualise something quite complicated and involving quite a lot of
> effort. I think I can see my way through it, but it is far too much
> for a single individual. I hope everyone can see this.
Yes, yes, certainly; but can YOU see how you might just be plain dead
wrong? The true scientist tests his/her hypotheses and dumps them as soon
as they fail looking for new ones. The amateur hangs onto old "science"
and will not let go of them until the bitter end.
--Chaumont Devin.