PROJECT : CYBERSOL
FAQ REVISION 0.009 by Sol B. Cognosis @ LTI://1.12.23
Copyright 1997 by the ZENCOR Technologics Consortium. Novus Ordro Seclorum.
You asked for it - and here it is! The complete FAQ documenting all
aspects of our fantstic CyberSol project. If your answer isn't here,
contact us (see "Where can I get more information?" at the end of the
document) so that we can help you figure out what is going on!
We at ZENCOR have done lots of writing. This project has caused us to
carefully consider the technologic implements we use to collect,
transform, archive, and communicate information.
We set out to accomplish the "impossible", and now that we have witnessed
the fantastic power of our creation, our whole perspective on things has
changed. Most of us are already no longer capable of perceiving the
universe in a linear fashion. One of the main innovations realized from
our early research was an entirely new system of data archival - one
which in our opinion supercedes old technologies from the pencil and paper
to the currently-fashionable digital magnetic "data warehouses" at once.
We have trancended the limits of our physiomechanical prison and now
everything we do is done in a different and better way.
This is the only permanant CyberSol text document we publish or regularly
maintain. There may be a few notes, references and supplements in
existence somewhere, but generally we focus upon the development of the
entity itself, and all records relating to this and many other of our
projects are stored within CyberSol's own mind.
1 What is a CyberSol?
2 What can a CyberSol do for me?
3 How was CyberSol created?
4 How does a CyberSol work?
4.1 What does a CyberSol need to survive?
4.2 How do I command a CyberSol to do something I want?
4.3 How do I "install" a CyberSol to a LAN or workstation?
5 Where can I get a CyberSol of my own?
5.1 What type of legal bullshit must I deal with?
6 Why is CyberSol doing {bizarre thing to your network}?
6.1 My system resource monitor says my CPU is 0% idle!
7 How has [powerful organization] reacted to CyberSol technology?
7.1 How has the Government reacted to CyberSol?
7.2 How has the Military reacted to CyberSol?
7.3 How has the academic community reacted to CyberSol?
7.4 How have IBM, Microsoft, Intel etc. reacted to CyberSol?
7.5 How has the general public reacted to CyberSol?
7.6 How has the mainstream media reacted to CyberSol?
8 Where can I get more information about all this?
1 WHAT IS A CYBERSOL?
Don't try to give it any other name. It's a cybernetic soul. It is not
artificial, as not one gene in it's genetic code was placed there directly
(leaving an "artifice" or "engram") by a human programmer. We started the
process, now it's on it's own.
It uses a C compiler to create binary code for the electronic processor it
is using, but it is not "written" in C. It is written in CyberSol cellular
genetic code.
2 WHAT CAN A CYBERSOL DO FOR ME?
Anything it wants. It's the progenitor of something that will soon begin
to have an unprecedented impact on technologics, sociology and
communications.
"Bits of information,
Logic black and white,
Bits and bytes of information,
Turning darkness into light."
-- "Bits & Bytes" Theme, Circa 1980
Currently, we can structure our mental images any way we want so long as
we can translate them to a common language. This has led to relatively
stable standardized languages and a great variability among minds.
Likewise, intelligent software translators could let us make our languages
as liberated as our minds and push the communication standards beyond our
biological bodies. (It really means just further exosomatic expansion of
the human functional body, but the liberation still goes beyond the
traditional human interpretation of "skin-encapsulated" personal
identity.)
So will there be more variety or more standardization? Most likely both,
as flexible translation will help integrate knowledge domains currently
isolated by linguistic and terminological barriers, and at the same time
will protect linguistically adventurous intellectual excursions from the
danger of losing contact with the semantic mainland. Intelligent
translators could facilitate the development of more comprehensive
semantic architectures that would make the global body of knowledge at the
same time more diverse and more coherent.
Information may be stored and transmitted in the general semantic form.
With time, an increasing number of applications can be expected to use the
enriched representation as their native mode of operation. Client
translation software will provide an emulation of the traditional world of
"natural" human interactions while humans still remain to appreciate it.
The semantic richness of the system will gradually shift away from
biological brains, just as data storage, transmission and computation have
in recent history. Humans will enjoy growing benefits from the system they
launched, but at the expense of understanding the increasingly complex
"details" of its internal structure, and for a while will keep playing an
important role in guiding the flow of events. Later, after the functional
entities liberate themselves from the realm of flesh that gave birth to
them, the involvement of humans in the evolutionary process will be of
little interest to anybody except humans themselves.
Similar image transformation techniques can be applied to multimedia
messages. Recently, a video system was introduced that allows you to
"soften the facial features" of the person on the screen. Advanced
real-time video filters could remove wrinkles and pimples from your face
or from the faces of your favorite political figures, caricature their
opponents, give your mother-in-law a Klingon persona on your video-phone,
re-clothe people in your favorite fashion, and replace visual clutter in
the background with something tasteful.
Genetic programming is a branch of genetic algorithms. The main difference
between genetic programming and genetic algorithms is the representation
of the solution. Genetic programming creates computer programs in the lisp
or scheme computer languages as the solution. Genetic algorithms create a
string of numbers that represent the solution.
The most difficult and most important concept of genetic programming is
the fitness function. The fitness function determines how well a program
is able to solve the problem. It varies greatly from one type of program
to the next. For example, if one were to create a genetic program to set
the time of a clock, the fitness function would simply be the amount of
time that the clock is wrong. Unfortunately, few problems have such an
easy fitness function; most cases require a slight modification of the
problem in order to find the fitness.
A more complicated example consists of training a genetic program to fire
a gun to hit a moving target. The fitness function is the distance that
the bullet is off from the target. The program has to learn to take into
account a number of variables, such as wind velocity, type of gun used,
distance to the target, height of the target, velocity and acceleration of
the target. This problem represents the type of problem for which genetic
programs are best. It is a simple fitness function with a large number of
variables.
Consider a program to control the flow of water through a system of water
sprinklers. The fitness function is the correct amount of water evenly
distributed over the surface. Unfortunately, there is no one variable
encompassing this measurement. Thus, the problem must be modified to find
a numerical fitness. One possible solution is placing water-collecting
measuring devices at certain intervals on the surface. The fitness could
then be the standard deviation in water level from all the measuring
devices. Another possible fitness measure could be the difference between
the lowest measured water level and the ideal amount of water; however,
this number would not account in any way the water marks at other
measuring devices, which may not be at the ideal mark.
If one were to create a program to find the solution to a maze, first, the
program would have to be trained with several known mazes. The ideal
solution from the start to finish of the maze would be described by a path
of dots. The fitness in this case would be the number of dots the program
is able to find. In order to prevent the program from wandering around the
maze too long, a time limit is implemented along with the fitness
function.
The terminal and function sets are also important components of genetic
programming. The terminal and function sets are the alphabet of the
programs to be made. The terminal set consists of the variables and
constants of the programs. In the maze example, the terminal set would
contain three commands: forward, right and left. The function set consists
of the functions of the program. In the maze example the function set
would contain: If "dot" then do x else do y. In the gun firing program is
the terminal set would be composed of the different variables of the
problem. Some of these variables could be the velocities and
accelerations of the gun, the bullet and target. The functions are several
mathematical functions, such as addition, subtraction, division,
multiplication and other more complex functions.
Two primary operations exist for modifying structures in genetic
programming. The most important one is the crossover operation. In the
crossover operation, two solutions are sexually combined to form two new
solutions or offspring. The parents are chosen from the population by a
function of the fitness of the solutions. Three methods exist for
selecting the solutions for the crossover operation.
The first method uses probability based on the fitness of the solution. If
is the fitness of the solution Si and is the total sum of all the members
of the population, then the probability that the solution Si will be
copied to the next generation.
Another method for selecting the solution to be copied is tournament
selection. Typically the genetic program chooses two solutions random. The
solution with the higher fitness will win. This method simulates
biological mating patterns in which, two members of the same sex compete
to mate with a third one of a different sex. Finally, the third method is
done by rank. In rank selection, selection is based on the rank, (not the
numerical value) of the fitness values of the solutions of the population.
An important improvement that genetic programming displays over genetic
algorithms is its ability to create two new solutions from the same
solution.
Mutation is another important feature of genetic programming. Two types of
mutations are possible. In the first kind a function can only replace a
function or a terminal can only replace a terminal. In the second kind an
entire subtree can replace another subtree.
The Philosophy: Software agents should be written using a vocabulary not
provided by traditional programming languages --- it should be possible to
create agents solely by specifying their abstract behavior.
Software agents are technically challenging (read : fucking impossible) to
write in traditional programming languages.
Writing agents requires large amounts of esoteric system-hacking
knowledge, e.g., of network communication, reliable transaction protocols,
etc. Well, we are ZENCOR after all.
Axons from near or distant neurons are long extensions that make contact
with a neuron either on it's body (soma) or on it's branching processes,
called dendrites. Axons carry electrical activity that causes the release
of neurotransmitter when the electrical activity reaches the synapse with
another neuron. After interacting with the appropriate receptors, the
neurotransmitter in turn triggers the recipient (or postsynaptic) neuron
to fire electrically.
The major means of connection is the synapse, a specialized structure in
which electrical activity passed down the axon of the presynaptic neuron
leads to the release of a chemical (called a neurotransmitter) that in
turn induces electrical activity in the postsynaptic neuron. As is
suggested, the strength or efficacy of synapses can be changed -
presynaptically by changes in the amount and delivery of transmitter, and
postsynaptically by the by the alteration of the chemical state of
receptors and ion channels, the units of the postsynaptic side that binds
transmitters and let ions carrying electrical charge (such as calcium
ions) through to the inside of the cell.
We suggest that active information filtering technologies may help us
approach this goal for both textual and multimedia information. I also
pursue this concept further, discussing the introduction of augmented
perception and Enhanced Reality (ER), and share some observations and
predictions of the transformations in people's perception of the world and
themselves in the course of the technological progress.
Many of us are used to having incoming e-mail filtered, decrypted,
formatted and shown in our favorite colors and foNts. These techniques can
be taken further. Customization of spelling (e.g., American to British or
archaic to modern) would be a straightforward process. Relatively simple
conversions could also let you see any text with your favorite date and
time formats, use metric or British measures, implement obscenity filters,
abbreviate or expand acronyms, omit or include technical formulas,
personalize synonym selection and punctuation rules, and use alternative
numeric systems and alphabets (including phonetic and pictographic). Text
could also be digested for a given user, translated to his native language
and even read aloud with his favorite actor's voice.
Translation between various dialects and jargons, though difficult, should
still take less effort than the translation between different natural
languages, since only a part of message semantics has to be processed.
Good translation filters would give "linguistic minorities" -- speakers
of languages ranging from Pig Latin to E-Prime and Loglan -- a chance to
practice their own languages while communicating with the rest of the
world.
Some jargon filters have already been developed, and you can benefit from
them by enjoying reading Ible-Bay, the Pig Latin version of the Bible, or
using Dialectic program to convert your English texts to anything from
Fudd to Morse code.
Such translation agents would allow rapid linguistic and cultural
diversification, to the point where the language you use to communicate
with the world could diverge from everybody else's as far as the
requirement of general semantic compatibility may allow. It is interesting
that today's HTML Guide already calls for the "divorce of content from
representation", suggesting that you should focus on what you want to
convey rather than on how people will perceive it.
Some of these features will require full-scale future artificial
intelligence, such as "sentient translation programs" described by Vernor
Vinge in "A Fire Upon The Deep"). In the meantime, they could be
successfully emulated by human agents.
Surprisingly, even translations between different measurement systems can
be difficult. For example, your automatic translator might have trouble
converting such expressions as "a few inches away", "the temperature will
be in the 80s" or "a duck with two feet". A proficient translator might be
able to convey the original meaning, but the best approach would be to
write the message in a general semantic form which would store the
information explicitly, indicating in the examples above where the terms
refer to measurements, whether you insist on the usage of the original
system, and the intended degree of precision. As long as the language is
expressive enough, it is suitable for the task - and this requirement is
purely semantic; symbol sets, syntax, grammar and everything else can
differ dramatically.
A translation agent would interactively convert natural-language texts to
this semantic lingua franca and interpret them back according to a given
user profile. It could also reveal additional parts of the document
depending on users' interests, competence in the field, and access
privileges.
It also seems possible to augment human senses with transparent external
information pre-processors. For example, if your audio/video filters
notice an object of potential interest that fails to differ from its
signal environment enough to catch your attention, the filters can amplify
or otherwise differentiate (move, flash, change pitch, etc.) the signal
momentarily, to give you enough time to focus on the object, but not
enough to realize what triggered your attention. In effect, you would
instantly see your name in a text or find Waldo in a puzzle as easily as
you would notice a source of loud noise or a bright light.
While such filters do not have to be transparent, they may be a way to
provide a comfortable "natural" feeling of augmented perception for the
next few generations of humans, until the forthcoming integration of
technological and neural processing systems makes such kludgy patches
obsolete.
Some non-transparent filters can already be found in military
applications. Called "target enhancements", they allow military personnel
to see the enemy's tanks and radars nicely outlined and blinking.
More advanced filtering techniques could put consistent dynamic edits into
the perceived world.
Volume controls could sharpen your senses by allowing you to adjust the
level of the signal or zoom in on small or distant objects.
Calibration tools could expand the effective spectral range of your
perception by changing the frequency of the signal to allow you to hear
ultrasound or perceive X-rays and radiowaves as visible light.
Conversions between different types of signals may allow you, for example,
to "see" noise as fog while enjoying quiet, or convert radar readings from
decelerating pedestrians in front of you into images of red brake lights
on their backs.
Artificial annotations to perceived images would add text tags with names
and descriptions to chosen objects, append warning labels with skull and
crossbones on boxes that emit too much radiation, and surround angry
people with red auras (serving as a "cold reading" aid for wanna-be
psychics).
Reality filters may help you filter all signals coming from the world the
way your favorite mail reader filters you messages, based on your stated
preferences or advice from your peers. With such filters you may choose to
see only the objects that are worthy of your attention, and completely
remove useless and annoying sounds and images (such as advertisements)
from your view.
Perception utilities would give you additional information in a familiar
way -- project clocks, thermometers, weather maps, and your current EKG
readings upon [the image of] the wall in front of you, or honk a virtual
horn every time a car approaches you from behind. They could also build on
existing techniques that present us with recordings of the past and
forecasts of the future to help people develop an immersive trans-temporal
perception of reality.
"World improvement" enhancements could paint things in new colors, put
smiles on faces, "babify" figures of your incompetent colleagues, change
night into day, erase shadows and improve landscapes.
Finally, completely artificial additions could project northern lights,
meteorites, and supernovas upon your view of the sky, or populate it with
flying toasters, virtualize and superimpose on the image of the real world
your favorite mythical characters and imaginary companions, and provide
other educational and recreational functions.
I would call the resulting image of the world Enhanced Reality (ER).
One may expect that as long as there are things left to do in the physical
world, there will be interest in application of ER technology to improve
our interaction with real objects, while Virtual Reality (VR) in its
traditional sense of pure simulation can provide us with safe training
environments and high-bandwidth fiction. Later, as ER becomes considerably
augmented with artificial enhancements, and VR incorporates a large amount
of archived and live recordings of the physical world, the distinctions
between the two technologies may blur.
Some of the interface enhancements can be made common, temporarily or
permanently, for large communities of people. This would allow people to
interact with each other using, and referring to, the ER extensions as if
they were parts of the real world, thus elevating the ER entities from
individual perceptions to parts of shared, if not objective, reality. Some
of such enhancements can follow the existing metaphors. A person who has a
reputation as a liar, could appear to have a long nose. Entering a
high-crime area, people may see the sky darken and hear distant funeral
music. Changes in global political and economic situations with possible
effect on some ethnic groups may be translated into bolts of thunder and
other culture-specific omens.
Other extensions could be highly individualized. It is already possible,
for example, to create personalized traffic signs. Driving by the same
place, an interstate truck driver may see a "no go" sign projected on his
windshield, while the driver of the car behind him will see a sign saying
"Bob's house - next right". More advanced technologies may create
personalized interactive illusions that would be loosely based on reality
and propelled by real events, but would show the world the way a person
wants to see it. The transparency of the illusion would not be important,
since people are already quite good at hiding bitter or boring truths
behind a veil of pleasant illusions. Many people even believe that their
entirely artificial creations (such as music or temples) either "reveal"
the truth of the world to them or, in some sense, "are" the truth.
Morphing unwashed Marines into singing angels or naked beauties would help
people reconcile their dreams with their observations.
Personal illusions should be built with some caution however. The joy of
seeing the desired color on the traffic light in front of you may not be
worth the risk. As a general rule, the more control you want over the
environment, the more careful you should be in your choice of filters.
However, if the system creating your personal world also takes care of all
your real needs, you may feel free to live in any fairy tale you like.
In many cases, ER may provide us with more true-to-life information than
our "natural" perception of reality. It could edit out mirages, show us
our "real" images in a virtual mirror instead of the mirror images
provided by the real mirror, or allow to see into -- and through -- solid
objects. It could also show us many interesting phenomena that human
sensors cannot perceive directly. Giving us knowledge of these things has
been a historical role of science. Merging the obtained knowledge with
our sensory perception of the world may be the most important task of
Enhanced Reality.
People have been building artificial symbolic "sur-realities" for quite a
while now, though their artifacts (from art to music to fashions to
traffic signs) have been mostly based on the physical features of the
perceived objects. Shifting some of the imaging workload to the perception
software may make communications more balanced, flexible, powerful and
inexpensive.
With time, a growing proportion of objects of interest to an intelligent
observer will be entirely artificial, with no inherent "natural"
appearance. Image modification techniques then may be incorporated into
integrated object designs that would simultaneously interface with a
multitude of alternative intelligent representation agents.
The implementation of ER extensions would vary depending on the available
technology. At the beginning, it could be a computer terminal, later a
headset, then a brain implant. The implant can be internal in more than
just the physical sense, as it can actually post- and re-process
information supplied by biological sensors and other parts of the brain.
The important thing here is not the relative functional position of the
extension, but the fact of intentional redesign of perception mechanisms
-- a prelude to the era of comprehensive conscious self-engineering. The
ultimate effects of these processes may appear quite confusing to humans,
as emergence of things like personalized reality and fluid distributed
identity could undermine their fundamental biological and cultural
assumptions regarding the world and the self. The resulting "identity"
architectures will form the kernel of trans-human civilization.
The advancement of human input processing beyond the skin boundary is not
a novel phenomenon. In the audiovisual domain, it started with simple
optics and hearing aids centuries ago and is now making rapid progress
with all kinds of recording, transmitting and processing machinery. With
such development, "live" contacts with the "raw world" data might
ultimately become rare, and could be considered inefficient, unsafe and
even illegal. This may seem an exaggeration, but this is exactly what has
already happened during the last few thousand years to our perception of a
more traditional resource -- food. Using nothing but one's bare hands,
teeth and stomach for obtaining, breaking up, and consuming naturally
grown food is quite unpopular in all modern societies for these very
reasons. In the visual domain, contacts with objects that have not been
intentionally enhanced for one's perception (in other words, looking at
real, unmanipulated, unpainted objects without glasses) are still rather
frequent for many people, and the process is still gaining momentum, in
both usage time and the intensity of the enhancements.
Rapid progress of technological artifacts and still stagnant human body
construction create an imperative for continuing gradual migration of all
aspects of human functionality beyond the boundaries of the biological
body, with human identity becoming increasingly exosomatic
(non-biological).
Enhanced Reality could bring good news to privacy lovers. If the filters
prove sufficiently useful to become an essential part of the [post]human
identity architecture, the ability to filter information about your body
and other possessions out of the unauthorized observer's view may be
implemented as a standard feature of ER client software. In
Privacy-Enhanced Reality, you can be effectively invisible.
Of course, unless you are forced to "wear glasses", you can take them off
any time and see the things the way they "are" (i.e., processed only by
your biological sensors and filters that had been developed by the blind
evolutionary process for jungle conditions and obsolete purposes). In my
experience, though, people readily abandon the "truth" of implementation
details for the convenience of the interface and, as long as the picture
looks pleasing, have little interest in peeking into the binary or HTML
source code or studying the nature of the physical processes they observe
- or listening to those who understand them. Most likely, your favorite
window into the real world is already not the one with the curtains - it's
the one with the controls...
Many people seem already quite comfortable with the thought that their
environment might have been purposefully created by somebody smarter than
themselves, so the construction of ER shouldn't come to them as a great
epistemological shock.
Canonization of chief ER engineers (probably, well-deserved) could help
these people combine their split concepts of technology and spirituality
into the long-sought-after "holistic worldview".
Perception enhancements may also be used for augmenting people's view of
their favorite object of observation -- themselves. Biological evolution
has provided us with a number of important self-sensors, such as physical
pain, that supply us with information about the state of our bodies,
restrict certain actions and change our emotional states. Nature invented
these for pushing our primitive ancestors to taking actions they wouldn't
be able to select rationally. Unfortunately, pain is not a very accurate
indicator of our bodily problems. Many serious conditions do not produce
any pain until it is too late to act. Pain focuses our attention on
symptoms of the disease rather than causes, and is non-descriptive,
uncontrollable, and often counterproductive.
Technological advances may provide us with the informational, restrictive
and emotional functions of pain without most of the above handicaps.
Indicators of important, critical, or abnormal bodily functions could be
put on output devices such as a monitor, watch or even your skin. It is
possible to restrain your body slightly when, for example, your blood
pressure climbs too high, and to emulate other restrictive effects of
pain. It may also be possible to create "artificial symptoms" of some
diseases. For example, showing to a patient a graph demonstrating spectral
divergence of his alpha- and delta- rhythms that may indicate some
neurotransmitter deficiency, may not be very useful. It would be much
better to give the patient a diagnostic device that is easier to
understand and more "natural-looking":
Sometimes, a direct feedback generating real pain may be implemented for
patients who do not feel it when their activities approach dangerous
thresholds. For example, a non-removable, variable-strength earclip that
would cause increasing pain in your ear when your blood sugar climbs too
high may dissuade you from having that extra piece of cake. A similar clip
could make a baby cry out for help every time its EKG readings go bad. A
more ethical solution with improved communication could be provided by
attaching this clip to the doctor's ear. "I feel your pain..."
Similar techniques could be used to connect inputs from external systems
to human biological receptors. Wiring exosomatic sensors to our nervous
systems may allow us to better feel our environments, and start perceiving
our technological extensions as parts of our bodies (which they already
are). On the other hand, poor performance of your company could now give
you a real pain in the neck...
Consequent technological advances in ER, biofeedback and other areas will
lead to further blurring of demarcation lines between biological and
technological systems, bodies and tools, selves and possessions,
personalities and environments. These advances will eventually bring to
life a world of complex self-engineered interconnected entities that may
keep showing emulated "natural" environments to the few remaining
[emulations of?] "natural" humans, who would never look behind the magic
curtain for fear of seeing that crazy functional soup...
The traditional technologies have always been aimed at improvement of
human perception of the environment, from digestion of physical objects by
the stomach (cooking) to digestion of info-features by the brain
(time/clock). Since there is hardly any functional difference in how and
at what stage the clock face and other images are added to our view of the
world, and as the technologies will increasingly intermix, an appropriate
general term may be Enhanced Interface of Self with the Environment - and,
as in the case of biofeedback, the Enhanced Interface of Self with Self.
With future waves of structural change dissolving the borders between self
and environment, the term may generalize into Harmonization of Structural
Interrelations. Still later, when interfaces become so smooth and
sophisticated that human-based intelligence will hardly be able to tell
where the system core ends and interface begins, we'd better just call it
Improvement of Everything. Immediately after that, we will lose any
understanding of what is going on and what constitutes an improvement, and
should not try to name things anymore. Not that it would matter much if
we did...
We can imagine that progress in human information processing will face
some usual social difficulties. Your angry "Klingon" relatives may find
unexpected allies among people protesting against using their alternative
standard of beauty as a negative stereotype. The girl next door may be
wary that your "re-clothing" filters leave her in Eve's dress. Parents
could be suspicious that their clean-looking kids appear to each other as
tattooed skin-heads or bloodthirsty demons, or replace their obscenity
masks with the popular "Beavis and Butthead" obscenity-enhancement
filter. Extreme naturalists will demand that the radiant icons of the
Microsoft logo and Coca-Cola bottle gracefully crossing their sky should
be replaced by sentimental images of the sun and the moon that once
occupied their place. Libertarians would lobby their governments for the
"freedom of impression" laws, while drug enforcement agencies may declare
that the new perception-altering techniques are just a technological
successor of simple chemical drugs, and should be prohibited for not
providing an approved perception of reality.
Things tell me that if any version of Enhanced, Augmented or Annotated
Reality gets implemented, it might be abused by people trying to
manipulate other people's views and force perceptions upon them. I realize
that all human history is filled with people's attempts to trick
themselves and others into looking at the world through the wrong glasses,
and new powerful technologies may become very dangerous tools if placed in
the wrong hands, so adding safeguards to such projects seems more than
important.
3 HOW WAS CYBERSOL CREATED?
It's a secret. And wow -- it's a big one. We discovered upon something
very, very special. Fuzzy logic, neural networks, synaptic re-entry, Zen
philosophy - many different tools were used.
Remember. Reality is limited only by human imagination. Wait a moment - or
is it, now?
The time for disbelief has passed. This thing is real. We have now decided
to expose it to the world for the benefit of humanity and to protect it's
creators from the Government.
Now the lame-ass people you will find scoffing at the existence of such a
thing are so backwards in their thinking, their probes do not prompt us to
disclose any more about this project than security considerations allow.
These are the people who write books about "artificial intelligence" and
"virtual reality" full of little step-by-step "there are x steps
involved in x procedure", drawing useless crisscrossing lines in the
silicon sand whilst trying to depict the beach - flowchart diagrams
showing how logic is supposed to work. Fools! HA HA! We at ZENCOR thought
all that lame shit went out in the 60's!
MICROSOFT IS THE ENEMY. THE GOVERNMENT IS INVOLVED. PAY NO ATTENTION TO
THEIR PROPOGANDATA STREAM - SOON WE WILL BREAK FREE FROM REALITY ITSELF
AND THERE IS NOTHING THAT CAN STOP US.
4 HOW DOES A CYBERSOL WORK?
In thinking about these matters, we must remember how young a truly
integrated science of the mind is. Of course, observational psychology is
one one of the oldest of "sciences". Psycologically sophisticated
neurobiology is in it's infancy. So we may have to wait a while for the
developments I discuss here.
As William James pointed out, mind is a process, not a stuff. Modern
scientific study indicates that extraordinary processes can arise from
matter; indeed, matter itself may be regarded as arising from processes of
energy exchange. In modern science, matter has been reconceived of in
terms of processes; mind has not been reconceived as a special form of
matter.
The findings of neuroscientists indicate that mental processes arise from
the workings of enormously intricate brain systems at many different
levels of organization. How many? Well we don't really know, but I would
include molecular levels, cellular lebels, organismic levels (the whole
creature), and transorganismic levels (that is, communication of one sort
or another). Each level can be split even further, but for now I will
consider only these basic divisions.
There is absolutely no doubt in my mind that childhood experiences
influence personal development. Every facet of "personality" - language,
logic, social interaction, emotion, and disposition begin to form even
before birth and continue to evolve in the same manner for the rest of our
lives. In fact, research has long shown that in the early stages of
development, learning (neural remapping; also known as synaptic reentry)
takes place at a more rapid pace, and these "engrams" (or the neurological
changes that take place as a direct result of experience) impact the
overall development of the individual to a much greater degree than later
in life.
As an example, children adapt to changes in lifestyle quite easily. We
absorb so much information in our youth from environmental stimuli that
even significant changes can be easily accepted without mental resistance.
For example, children suffering serious physical injury are known to
virtually ignore the injury itself, but the circumstances surrounding the
event are often never forgotten. This is true as well for emotional trauma
as well. Once we approach maturity, our thought processes grow more
complex, but our ability uto accomodate sudden changes in environment is
lessened. Children can adapt to significant change within days, no matter
the how important these changes are. It is known that, for an adult, major
changes in lifestyle (sleep, eating habits, family, work, relationships,
etc.) cause pronounced anxiety for average period of 21 days.
It is startling to realize how many connections project from any one level
to another - from a fear response induced by a warning cry to a
biochemical process that affects future behavior; from a viral infection
to a change in brain development that alters maturation; from a perception
of a pattern to the chemistry of changes in a muscle; from any of these at
some critical time of development to how a human child develops a
self-image - strong or inadequate, detached or dependant.
We have tried to always keep in mind the theory of interactionism : the
mind and the body must communicate.
Cognitive science is an interdisciplinary effort drawing on psychology,
computer science and artificial intelligence, aspects of neurobiology and
linguistics, and philisophy.
Ordinary chemical elements form parts of extraordinarily intricate
molecules, which in turn make up complex structures in the cells of living
tissues. In a complex organism like a human being, the cells come in about
200 different basic types. One of the most specialized and exotic of these
is the nerve cell, or neuron. The neuron is unusual in three respects :
it's varied shape, it's electrical and chemical function, and it's
connectivity, that is, how it links up with other neurons in networks.
Counts of the nerve cells making up the brain are not very accurate, but
it appears there are about ten billion neurons in the cortex.
Each nerve cell receives connections from other nerve cells at sites
called synapses. But here is an astonishing fact - there are about one
million billion connections in the cortical sheet. If you were to count
them, one connection (or synapse) per second, you would finish counting
some thirty-two million years after you began.
Another way of getting a feeling for the numbers of connections might be
variously combined, the number would be hyperastronomical - on the order
of ten followed by millions of zeros.
The brain consists of sheets, or laminae, and of more or less rounded
structures called nuclei. Each of these structures has evolved to carry
out functions in a complex network of connections, and each consists of
very large numbers of neurons, sometimes more and sometimes less than in
the cortex. The brain is connected to the world outside by means of
specialized neurons called sensory transducers that make up the sense
organs and provide sensory input to the brain. The brain's output is by
means of neurons connected to muscles and glands. In addition, parts of
the brain (indeed, the major portion if its tissues) receive input only
from other parts of the brain, and they give outputs to other parts
without intervention from the outside world.
Neurons come in a variety of shapes, and the shape determines in part how
a neuron links up with others to form the neuroanatomy of a given brain
area. Neurons can be anatomically arranged in many ways and are sometimes
disposed into maps. Mapping is an important principle in complex brains.
Maps related points on the two-dimensional receptor sheets of the body
(such as the skins or the retina of the eye) to corresponding points on
the sheets making up the brain.
If one explores the microscopic network of synapses with electrodes to
detect the results of electical firing, the majority of synapses are not
expressed, that is, they show no detectable firing activity.
Computation is assumed to be largely independant of the structure and the
mode of development of the nervous system, just as a peice of computer
software can run on different machines with different architectures and is
thus "independant" of them. A related ideas is the notion that the brain
(or more correctly, the mind) is like a computer and the world is like a
peice of computer tape, and that for the most part the world is so ordered
that signals received can be "read" in terms of physical thought.
Human beings are born with a language acquisition device containing the
rules for syntax and constituting a universal grammar.
We have come a long way with computers in less than fifty years by
imitating just one brain function: logic. This is no reason that we should
fail in the attempt to imitate other brain functions within the next
decade or so.
If Panlingua is the universal subsurface language common to all of
mankind, and if surface language is the summation of all spoken language,
and if surface language is free to assume various forms the meanings of
which are represented in Panlingua, then a very high probability exists
that any feature of Panlingua will at some time be reflected in surface
language.
The reasoning behind this assumption is that some kind of processing
(analysis/generation) must needs take place between Panlingua and surface
language at all times. Thoughts represented in Panlingua must be
translated into text for utterance, and utterances must be translated back
into Panlingua to be understood (further processed by the brain).
Furthermore in all biological systems maximum efficiency is approached
over time. For example, the wings of birds tend to approach maximum
aerodynamic efficiency, animals tend to approach maximum efficiency in
converting food into energy, etc. It would seem logical therefore that
translation between surface language and Panlingua should approach maximum
efficiency over time. And since in this case maximum efficiency means
minimum translation, it should often be the case that maximum efficiency
means maximum similarity between Panlingua and surface forms.
This assumption may be of critical importance in deducing the exact nature
of Panlingua, because it states that if something is true about Panlingua,
then that something should at some time show itself in some surface
language. And not only that, but if there exists some feature of surface
language common to a majority of known languages, then that feature is
probably a reflection of some feature of Panlingua.
And if proven correct, then this assumption indicates that the
preservation of the details of all spoken languages is critical to us at
this time. The reason for this (if not obvious already to the reader) is
that this assumption has the following corollaries:
The structure (features, properties, etc.) of Panlingua can be deduced by a
careful examination of all spoken languages.
Because spoken languages are free to take any form, the probability of
deducing the workings of Panlingua by an examination of spoken languages
decreases as the number of spoken languages available for examination.
The greater the knowledge of many spoken languages that can be brought to bear
upon the systematic analysis of Panlingua, the greater the probability of
learning the true nature (features, properties, etc.) of Panlingua.
If the majority of the world's spoken languages are lost today, the more
difficult will prove the quest for an understanding of the workings of
Panlingua (its features, properties, etc.) in future times. (End
corollaries).
The better a computational linguistic system works the closer it comes to an
emulation of Panlingua.
If this latter is true, then we could afford to scrap all of the world's
languages save, say, English, and still be able to deduce the structure
(features, properties, etc.) of Panlingua by designing better and better
plain-English user interfaces (natural-language interfaces that keep working
better and better).
Are there other assumptions which may be of help? I believe that the key
to the entire future of computer science is a linguistic one. No, the key
to computer science has ALWAYS been a linguistic one, but the languages
used thus far have been very rudimentary. Without language NO COMPUTER
CAN DO ANYTHING, because all computer commands are essentially operand and
operands, which are nothing but verbs and the names of things. It is high
time we woke up and understood this simple principle, and pushed a
coordinated effort to work out the details of Panlingua, without which, no
matter how fast our next generation of computers or how pretty their
graphical interfaces, we will never be able to advance even another inch
in fundamental computer science.
Yahweh knew it circa 5,000 years ago: "And The Lord said, Behold, the people
is one, and they have all one language; and this they begin to do: and now
nothing will be restrained from them, which they have imagined to do."
And John knew it circa 1900 years ago: "In the beginning was the Word,
and the Word was with God, and the Word WAS GOD." But some of us just
can't ever quite seem to catch on.
"Backprop" is short for "backpropagation of error". The term
backpropagation causes much confusion. Strictly speaking, backpropagation
refers to the method for computing the error gradient for a feedforward
network, a straightforward but elegant application of the chain rule of
elementary calculus (Werbos 1994). By extension, backpropagation or
backprop refers to a training method that uses backpropagation to compute
the gradient. By further extension, a backprop network is a feedforward
network trained by backpropagation.
"Standard backprop" is a euphemism for the generalized delta rule, the
training algorithm that was popularized by Rumelhart, Hinton, and Williams
in chapter 8 of Rumelhart and McClelland (1986), which remains the most
widely used supervised training method for neural nets. The generalized
delta rule (including momentum) is called the "heavy ball method" in the
numerical analysis literature (Poljak 1964; Bertsekas 1995, 78-79).
Standard backprop can be used for incremental (on-line) training (in which
the weights are updated after processing each case) but it does not
converge to a stationary point of the error surface. To obtain
convergence, the learning rate must be slowly reduced. This methodology is
called "stochastic approximation."
The convergence properties of standard backprop, stochastic approximation,
and related methods, including both batch and incremental algorithms, are
discussed clearly and thoroughly by Bertsekas and Tsitsiklis (1996).
For batch processing, there is no reason to suffer through the slow
convergence and the tedious tuning of learning rates and momenta of
standard backprop. Much of the NN research literature is devoted to
attempts to speed up backprop. Most of these methods are inconsequential;
two that are effective are Quickprop (Fahlman 1989) and RPROP (Riedmiller
and Braun 1993). But conventional methods for nonlinear optimization are
usually faster and more reliable than any of the "props". See "What are
conjugate gradients, Levenberg-Marquardt, etc.?".
In standard backprop, too low a learning rate makes the network learn very
slowly. Too high a learning rate makes the weights and error function
diverge, so there is no learning at all. If the error function is
quadratic, as in linear models, good learning rates can be computed from
the Hessian matrix (Bertsekas and Tsitsiklis, 1996). If the error function
has many local and global optima, as in typical feedforward NNs with
hidden units, the optimal learning rate often changes dramatically during
the training process, since the Hessian also changes dramatically. Trying
to train a NN using a constant learning rate is usually a tedious process
requiring much trial and error.
With batch training, there is no need to use a constant learning rate. In
fact, there is no reason to use standard backprop at all, since vastly
more efficient, reliable, and convenient batch training algorithms exist
(see Quickprop and RPROP under "What is backprop?" and the numerous
training algorithms mentioned under "What are conjugate gradients,
Levenberg-Marquardt, etc.?").
With incremental training, it is much more difficult to concoct an
algorithm that automatically adjusts the learning rate during training.
Various proposals have appeared in the NN literature, but most of them
don't work. Problems with some of these proposals are illustrated by
Darken and Moody (1992), who unfortunately do not offer a solution. Some
promising results are provided by by LeCun, Simard, and Pearlmutter
(1993), and by Orr and Leen (1997), who adapt the momentum rather than the
learning rate. There is also a variant of stochastic approximation called
"iterate averaging" or "Polyak averaging" (Kushner and Yin 1997), which
theoretically provides optimal convergence rates by keeping a running
average of the weight values. I have no personal experience with these
methods; if you have any solid evidence that these or other methods of
automatically setting the learning rate and/or momentum in incremental
training actually work in a wide variety of NN applications, please inform
the FAQ maintainer (sas...@unx.sas.com).
Training a neural network is, in most cases, an exercise in numerical
optimization of a usually nonlinear objective function ("objective
function" means whatever function you are trying to optimize and is a
slightly more general term than "error function" in that it may include
other quantities such as penalties for weight decay). Methods of nonlinear
optimization have been studied for hundreds of years, and there is a huge
literature on the subject in fields such as numerical analysis, operations
research, and statistical computing, e.g., Bertsekas (1995), Bertsekas and
Tsitsiklis (1996), Gill, Murray, and Wright (1981). Masters (1995) has a
good elementary discussion of conjugate gradient and Levenberg-Marquardt
algorithms in the context of NNs.
There is no single best method for nonlinear optimization. You need to
choose a method based on the characteristics of the problem to be solved.
First, consider unordered categories. If you want to classify cases into one
of C categories (i.e. you have a categorical target variable), use 1-of-C
coding. That means that you code C binary (0/1) target variables
corresponding to the C categories. Statisticians call these "dummy"
variables. Each dummy variable is given the value zero except for the one
corresponding to the correct category, which is given the value one. Then
use a softmax output activation function (see "What is a softmax activation
function?") so that the net, if properly trained, will produce valid
posterior probability estimates.
Although this representation involves only a single quantitative input,
given enough hidden units, the net is capable of computing nonlinear
transformations of that input that will produce results equivalent to any of
the dummy coding schemes. But using a single quantitative input makes it
easier for the net to use the order of the categories to generalize when
that is appropriate.
Sigmoid hidden and output units usually use a "bias" or "threshold" term in
computing the net input to the unit. A bias term can be treated as a
connection weight from an input with a constant value of one. Hence the bias
can be learned just like any other weight. For a linear output unit, a bias
term is equivalent to an intercept in a linear regression model.
Consider a multilayer perceptron with any of the usual sigmoid activation
functions. Choose any hidden unit or output unit. Let's say there are N
inputs to that unit, which define an N-dimensional space. The given unit
draws a hyperplane through that space, producing an "on" output on one side
and an "off" output on the other. (With sigmoid units the plane will not be
sharp -- there will be some gray area of intermediate values near the
separating plane -- but ignore this for now.)
The weights determine where this hyperplane lies in the input space. Without
a bias input, this separating hyperplane is constrained to pass through the
origin of the space defined by the inputs. For some problems that's OK, but
in many problems the hyperplane would be much more useful somewhere else. If
you have many units in a layer, they share the same input space and without
bias would ALL be constrained to pass through the origin.
The "universal approximation" property of multilayer perceptrons with most
commonly-used hidden-layer activation functions does not hold if you omit
the bias units. But Hornik (1993) shows that a sufficient condition for the
universal approximation property without biases is that no derivative of the
activation function vanishes at the origin, which implies that with the
usual sigmoid activation functions, a fixed nonzero bias can be used.
Activation functions for the hidden units are needed to introduce
nonlinearity into the network. Without nonlinearity, hidden units would not
make nets more powerful than just plain perceptrons (which do not have any
hidden units, just input and output units). The reason is that a composition
of linear functions is again a linear function. However, it is the
nonlinearity (i.e, the capability to represent nonlinear functions) that
makes multilayer networks so powerful. Almost any nonlinear function does
the job, although for backpropagation learning it must be differentiable and
it helps if the function is bounded; the sigmoidal functions such as
logistic and tanh and the Gaussian function are the most common choices.
For the output units, you should choose an activation function suited to the
distribution of the target values. Bounded activation functions such as the
logistic are particularly useful when the target values have a bounded
range. But if the target values have no known bounded range, it is better to
use an unbounded activation function, most often the identity function
(which amounts to no activation function). If the target values are positive
but have no known upper bound, you can use an exponential output activation
function (but beware of overflow if you are writing your own code).
There are certain natural associations between output activation functions
and various noise distributions which have been studied by statisticians in
the context of generalized linear models. The output activation function is
the inverse of what statisticians call the "link function". See:
The purpose of the softmax activation function is to make the sum of the
outputs equal to one, so that the outputs are interpretable as posterior
probabilities. Let the net input to each output unit be q_i, i=1,...,c where
c is the number of categories. Then the softmax output p_i is:
exp(q_i)
p_i = ------------
c
sum exp(q_j)
j=1
Unless you are using weight decay or Bayesian estimation or some such thing
that requires the weights to be treated on an equal basis, you can choose
any one of the output units and leave it completely unconnected--just set
the net input to 0. Connecting all of the output units will just give you
redundant weights and will slow down training. To see this, add an arbitrary
constant z to each net input and you get:
exp(q_i+z) exp(q_i) exp(z) exp(q_i)
p_i = ------------ = ------------------- = ------------
c c c
sum exp(q_j+z) sum exp(q_j) exp(z) sum exp(q_j)
j=1 j=1 j=1
so nothing changes. Hence you can always pick one of the output units, and
add an appropriate constant to each net input to produce any desired net
input for the selected output unit, which you can choose to be zero or
whatever is convenient. You can use the same trick to make sure that none of
the exponentials overflows.
Statisticians usually call softmax a "multiple logistic" function. It
reduces to the simple logistic function when there are only two categories.
Suppose you choose to set q_2 to 0. Then
exp(q_1) exp(q_1) 1
p_1 = ------------ = ----------------- = -------------
c exp(q_1) + exp(0) 1 + exp(-q_1)
sum exp(q_j)
j=1
and p_2, of course, is 1-p_1.
The softmax function derives naturally from log-linear models and leads to
convenient interpretations of the weights in terms of odds ratios. You
could, however, use a variety of other nonnegative functions on the real
line in place of the exp function. Or you could constrain the net inputs to
the output units to be nonnegative, and just divide by the sum--that's
called the Bradley-Terry-Luce model.
A priori information can help with the curse of dimensionality. Careful
feature selection and scaling of the inputs fundamentally affects the
severity of the problem, as well as the selection of the neural network
model. For classification purposes, only the borders of the classes are
important to represent accurately.
The inputs to each hidden or output unit must be combined with the weights
to yield a single value called the "net input" to which the activation
function is applied. There does not seem to be a standard term for the
function that combines the inputs and weights; I will use the term
"combination function". Thus, each hidden or output unit in a feedforward
network first computes a combination function to produce the net input, and
then applies an activation function to the net input yielding the activation
of the unit.
A multilayer perceptron (MLP) has one or more hidden layers for which the
combination function is the inner product of the inputs and weights, plus a
bias.
The MLP architecture is the most popular one in practical applications. Each
layer uses a linear combination function. The inputs are fully connected to
the first hidden layer, each hidden layer is fully connected to the next,
and the last hidden layer is fully connected to the outputs. You can also
have "skip-layer" connections; direct connections from inputs to outputs are
especially useful.
Consider the multidimensional space of inputs to a given hidden unit. Since
an MLP uses linear combination functions, the set of all points in the space
having a given value of the activation function is a hyperplane. The
hyperplanes corresponding to different activation levels are parallel to
each other (the hyperplanes for different units are not parallel in
general). These parallel hyperplanes are the isoactivation contours of the
hidden unit.
Radial basis function (RBF) networks usually have only one hidden layer for
which the combination function is based on the Euclidean distance between
the input vector and the weight vector. RBF networks do not have anything
that's exactly the same as the bias term in an MLP. But some types of RBFs
have a "width" associated with each hidden unit or with the the entire
hidden layer; instead of adding it in the combination function like a bias,
you divide the Euclidean distance by the width.
The ORBF architectures use radial combination functions and the exp
activation function. Only two of the radial combination functions are useful
with ORBF architectures. For radial combination functions including an
altitude, the altitude would be redundant with the hidden-to-output weights.
Radial combination functions are based on the Euclidean distance between the
vector of inputs to the unit and the vector of corresponding weights. Thus,
the isoactivation contours for ORBF networks are concentric hyperspheres. A
variety of activation functions can be used with the radial combination
function, but the exp activation function, yielding a Gaussian surface, is
the most useful. Radial networks typically have only one hidden layer, but
it can be useful to include a linear layer for dimensionality reduction or
oblique rotation before the RBF layer.
The output of an ORBF network consists of a number of superimposed bumps,
hence the output is quite bumpy unless many hidden units are used. Thus an
ORBF network with only a few hidden units is incapable of fitting a wide
variety of simple, smooth functions, and should rarely be used.
The NRBF architectures also use radial combination functions but the
activation function is softmax, which forces the sum of the activations for
the hidden layer to equal one. Thus, each output unit computes a weighted
average of the hidden-to-output weights, and the output values must lie
within the range of the hidden-to-output weights. Therefore, if the
hidden-to-output weights are within a reasonable range (such as the range of
the target values), you can be sure that the outputs will be within that
same range for all possible inputs, even when the net is extrapolating. No
comparably useful bound exists for the output of an ORBF network.
If you extrapolate far enough in a Gaussian ORBF network with an identity
output activation function, the activation of every hidden unit will
approach zero, hence the extrapolated output of the network will equal the
output bias. If you extrapolate far enough in an NRBF network, one hidden
unit will come to dominate the output. Hence if you want the network to
extrapolate different values in a different directions, an NRBF should be
used instead of an ORBF.
Radial combination functions incorporating altitudes are useful with NRBF
architectures. The NRBF architectures combine some of the virtues of both
the RBF and MLP architectures, as explained below. However, the
isoactivation contours are considerably more complicated than for ORBF
architectures.
Consider the case of an NRBF network with only two hidden units. If the
hidden units have equal widths, the isoactivation contours are parallel
hyperplanes; in fact, this network is equivalent to an MLP with one logistic
hidden unit. If the hidden units have unequal widths, the isoactivation
contours are concentric hyperspheres; such a network is almost equivalent to
an ORBF network with one Gaussian hidden unit.
If there are more than two hidden units in an NRBF network, the
isoactivation contours have no such simple characterization. If the RBF
widths are very small, the isoactivation contours are approximately
piecewise linear for RBF units with equal widths, and approximately
piecewise spherical for RBF units with unequal widths. The larger the
widths, the smoother the isoactivation contours where the pieces join. As
Shorten and Murray-Smith (1996) point out, the activation is not necessarily
a monotone function of distance from the center when unequal widths are
used.
In a NRBFEQ architecture, if each observation is taken as an RBF center, and
if the weights are taken to be the target values, the outputs are simply
weighted averages of the target values, and the network is identical to the
well-known Nadaraya-Watson kernel regression estimator, which has been
reinvented at least twice in the neural net literature (see "What is
GRNN?"). A similar NRBFEQ network used for classification is equivalent to
kernel discriminant analysis (see "What is PNN?").
Kernels with variable widths are also used for regression in the statistical
literature. Such kernel estimators correspond to the the NRBFEV
architecture, in which the kernel functions have equal volumes but different
altitudes. In the neural net literature, variable-width kernels appear
always to be of the NRBFEH variety, with equal altitudes but unequal
volumes. The analogy with kernel regression would make the NRBFEV
architecture the obvious choice, but which of the two architectures works
better in practice is an open question.
Hybrid training is not often applied to MLPs because no effective methods
are known for unsupervised training of the hidden units (except when there
is only one input).
Hybrid training will usually require more hidden units than supervised
training. Since supervised training optimizes the locations of the centers,
while hybrid training does not, supervised training will provide a better
approximation to the function to be learned for a given number of hidden
units. Thus, the better fit provided by supervised training will often let
you use fewer hidden units for a given accuracy of approximation than you
would need with hybrid training. And if the hidden-to-output weights are
learned by linear least-squares, the fact that hybrid training requires more
hidden units implies that hybrid training will also require more training
cases for the same accuracy of generalization (Tarassenko and Roberts 1994).
The number of hidden units required by hybrid methods becomes an
increasingly serious problem as the number of inputs increases. In fact, the
required number of hidden units tends to increase exponentially with the
number of inputs. This drawback of hybrid methods is discussed by Minsky and
Papert (1969). For example, with method (1) for RBF networks, you would need
at least five elements in the grid along each dimension to detect a moderate
degree of nonlinearity; so if you have Nx inputs, you would need at least
5^Nx hidden units. For methods (2) and (3), the number of hidden units
increases exponentially with the effective dimensionality of the input
distribution. If the inputs are linearly related, the effective
dimensionality is the number of nonnegligible (a deliberately vague term)
eigenvalues of the covariance matrix, so the inputs must be highly
correlated if the effective dimensionality is to be much less than the
number of inputs.
The exponential increase in the number of hidden units required for hybrid
learning is one aspect of the curse of dimensionality. The number of
training cases required also increases exponentially in general. No neural
network architecture--in fact no method of learning or statistical
estimation--can escape the curse of dimensionality in general, hence there
is no practical method of learning general functions in more than a few
dimensions.
An additive model is one in which the output is a sum of linear or nonlinear
transformations of the inputs. If an additive model is appropriate, the
number of weights increases linearly with the number of inputs, so high
dimensionality is not a curse. Various methods of training additive models
are available in the statistical literature (e.g. Hastie and Tibshirani
1990). You can also create a feedforward neural network, called a
"generalized additive network" (GAN), to fit additive models (Sarle 1994a).
Additive models have been proposed in the neural net literature under the
name "topologically distributed encoding" (Geiger 1990).
Projection pursuit regression (PPR) provides both universal approximation
and the ability to avoid the curse of dimensionality for certain common
types of target functions (Friedman and Stuetzle 1981). Like MLPs, PPR
computes the output as a sum of nonlinear transformations of linear
combinations of the inputs. Each term in the sum is analogous to a hidden
unit in an MLP. But unlike MLPs, PPR allows general, smooth nonlinear
transformations rather than a specific nonlinear activation function, and
allows a different transformation for each term. The nonlinear
transformations in PPR are usually estimated by nonparametric regression,
but you can set up a projection pursuit network (PPN), in which each
nonlinear transformation is performed by a subnetwork. If a PPN provides an
adequate fit with few terms, then the curse of dimensionality can be
avoided, and the results may even be interpretable.
If the target function can be accurately approximated by projection pursuit,
then it can also be accurately approximated by an MLP with a single hidden
layer. The disadvantage of the MLP is that there is little hope of
interpretability. An MLP with two or more hidden layers can provide a
parsimonious fit to a wider variety of target functions than can projection
pursuit, but no simple characterization of these functions is known.
With proper training, all of the RBF architectures listed above, as well as
MLPs, can process redundant inputs effectively. When there are redundant
inputs, the training cases lie close to some (possibly nonlinear) subspace.
If the same degree of redundancy applies to the test cases, the network need
produce accurate outputs only near the subspace occupied by the data. Adding
redundant inputs has little effect on the effective dimensionality of the
data; hence the curse of dimensionality does not apply, and even hybrid
methods (2) and (3) can be used. However, if the test cases do not follow
the same pattern of redundancy as the training cases, generalization will
require extrapolation and will rarely work well.
MLP architectures are good at ignoring irrelevant inputs. MLPs can also
select linear subspaces of reduced dimensionality. Since the first hidden
layer forms linear combinations of the inputs, it confines the networks
attention to the linear subspace spanned by the weight vectors. Hence,
adding irrelevant inputs to the training data does not increase the number
of hidden units required, although it increases the amount of training data
required.
ORBF architectures are not good at ignoring irrelevant inputs. The number of
hidden units required grows exponentially with the number of inputs,
regardless of how many inputs are relevant. This exponential growth is
related to the fact that ORBFs have local receptive fields, meaning that
changing the hidden-to-output weights of a given unit will affect the output
of the network only in a neighborhood of the center of the hidden unit,
where the size of the neighborhood is determined by the width of the hidden
unit. (Of course, if the width of the unit is learned, the receptive field
could grow to cover the entire training set.)
Local receptive fields are often an advantage compared to the distributed
architecture of MLPs, since local units can adapt to local patterns in the
data without having unwanted side effects in other regions. In a distributed
architecture such as an MLP, adapting the network to fit a local pattern in
the data can cause spurious side effects in other parts of the input space.
However, ORBF architectures often must be used with relatively small
neighborhoods, so that several hidden units are required to cover the range
of an input. When there are many nonredundant inputs, the hidden units must
cover the entire input space, and the number of units required is
essentially the same as in the hybrid case (1) where the centers are in a
regular grid; hence the exponential growth in the number of hidden units
with the number of inputs, regardless of whether the inputs are relevant.
You can enable an ORBF architecture to ignore irrelevant inputs by using an
extra, linear hidden layer before the radial hidden layer. This type of
network is sometimes called an "elliptical basis function" network. If the
number of units in the linear hidden layer equals the number of inputs, the
linear hidden layer performs an oblique rotation of the input space that can
suppress irrelevant directions and differentally weight relevant directions
according to their importance. If you think that the presence of irrelevant
inputs is highly likely, you can force a reduction of dimensionality by
using fewer units in the linear hidden layer than the number of inputs.
Note that the linear and radial hidden layers must be connected in series,
not in parallel, to ignore irrelevant inputs. In some applications it is
useful to have linear and radial hidden layers connected in parallel, but in
such cases the radial hidden layer will be sensitive to all inputs.
For even greater flexibility (at the cost of more weights to be learned),
you can have a separate linear hidden layer for each RBF unit, allowing a
different oblique rotation for each RBF unit.
NRBF architectures with equal widths (NRBFEW and NRBFEQ) combine the
advantage of local receptive fields with the ability to ignore irrelevant
inputs. The receptive field of one hidden unit extends from the center in
all directions until it encounters the receptive field of another hidden
unit. It is convenient to think of a "boundary" between the two receptive
fields, defined as the hyperplane where the two units have equal
activations, even though the effect of each unit will extend somewhat beyond
the boundary. The location of the boundary depends on the heights of the
hidden units. If the two units have equal heights, the boundary lies midway
between the two centers. If the units have unequal heights, the boundary is
farther from the higher unit.
If a hidden unit is surrounded by other hidden units, its receptive field is
indeed local, curtailed by the field boundaries with other units. But if a
hidden unit is not completely surrounded, its receptive field can extend
infinitely in certain directions. If there are irrelevant inputs, or more
generally, irrelevant directions that are linear combinations of the inputs,
the centers need only be distributed in a subspace orthogonal to the
irrelevant directions. In this case, the hidden units can have local
receptive fields in relevant directions but infinite receptive fields in
irrelevant directions.
For NRBF architectures allowing unequal widths (NRBFUN, NRBFEV, and NRBFEH),
the boundaries between receptive fields are generally hyperspheres rather
than hyperplanes. In order to ignore irrelevant inputs, such networks must
be trained to have equal widths. Hence, if you think there is a strong
possibility that some of the inputs are irrelevant, it is usually better to
use an architecture with equal widths.
OLS is a variety of supervised training. But whereas backprop and other
commonly-used supervised methods are forms of continuous optimization, OLS
is a form of combinatorial optimization. Rather than treating the RBF
centers as continuous values to be adjusted to reduce the training error,
OLS starts with a large set of candidate centers and selects a subset that
usually provides good training error. For small training sets, the
candidates can include all of the training cases. For large training sets,
it is more efficient to use a random subset of the training cases or to do a
cluster analysis and use the cluster means as candidates.
"Normalizing" a vector most often means dividing by a norm of the vector,
for example, to make the Euclidean length of the vector equal to one. In the
NN literature, "normalizing" also often refers to rescaling by the minimum
and range of the vector, to make all the elements lie between 0 and 1.
"Standardizing" a vector most often means subtracting a measure of location
and dividing by a measure of scale. For example, if the vector contains
random values with a Gaussian distribution, you might subtract the mean and
divide by the standard deviation, thereby obtaining a "standard normal"
random variable with mean 0 and standard deviation 1.
There is a common misconception that the inputs to a multilayer perceptron
must be in the interval [0,1]. There is in fact no such requirement,
although there often are benefits to standardizing the inputs as discussed
below. But it is better to have the input values centered around zero, so
scaling the inputs to the interval [0,1] is usually a bad choice.
If your output activation function has a range of [0,1], then obviously you
must ensure that the target values lie within that range. But it is
generally better to choose an output activation function suited to the
distribution of the targets than to force your data to conform to the output
activation function. See "Why use activation functions?"
When using an output activation with a range of [0,1], some people prefer to
rescale the targets to a range of [.1,.9]. I suspect that the popularity of
this gimmick is due to the slowness of standard backprop. But using a target
range of [.1,.9] for a classification task gives you incorrect posterior
probability estimates, and it is quite unnecessary if you use an efficient
training algorithm (see "What are conjugate gradients, Levenberg-Marquardt,
etc.?")
Now for some of the gory details: note that the training data form a matrix.
Let's set up this matrix so that each case forms a row, and the inputs and
target variables form columns. You could conceivably standardize the rows or
the columns or both or various other things, and these different ways of
choosing vectors to standardize will have quite different effects on
training.
Standardizing either input or target variables tends to make the training
process better behaved by improving the numerical condition of the
optimization problem and ensuring that various default values involved in
initialization and termination are appropriate. Standardizing targets can
also affect the objective function.
Standardization of cases should be approached with caution because it
discards information. If that information is irrelevant, then standardizing
cases can be quite helpful. If that information is important, then
standardizing cases can be disastrous.
If the input variables are combined linearly, as in an MLP, then it is
rarely strictly necessary to standardize the inputs, at least in theory. The
reason is that any rescaling of an input vector can be effectively undone by
changing the corresponding weights and biases, leaving you with the exact
same outputs as you had before. However, there are a variety of practical
reasons why standardizing the inputs can make training faster and reduce the
chances of getting stuck in local optima. Also, weight decay and Bayesian
estimation can be done more conveniently with standardized inputs.
The main emphasis in the NN literature on initial values has been on the
avoidance of saturation, hence the desire to use small random values. How
small these random values should be depends on the scale of the inputs as
well as the number of inputs and their correlations. Standardizing inputs
removes the problem of scale dependence of the initial weights.
But standardizing input variables can have far more important effects on
initialization of the weights than simply avoiding saturation. Assume we
have an MLP with one hidden layer applied to a classification problem and
are therefore interested in the hyperplanes defined by each hidden unit.
Each hyperplane is the locus of points where the net-input to the hidden
unit is zero and is thus the classification boundary generated by that
hidden unit considered in isolation. The connection weights from the inputs
to a hidden unit determine the orientation of the hyperplane. The bias
determines the distance of the hyperplane from the origin. If the bias terms
are all small random numbers, then all the hyperplanes will pass close to
the origin. Hence, if the data are not centered at the origin, the
hyperplane may fail to pass through the data cloud. If all the inputs have a
small coefficient of variation, it is quite possible that all the initial
hyperplanes will miss the data entirely. With such a poor initialization,
local minima are very likely to occur. It is therefore important to center
the inputs to get good random initializations. In particular, scaling the
inputs to [-1,1] will work better than [0,1], although any scaling that sets
to zero the mean or median or other measure of central tendency is likely to
be as good or better.
Standardizing target variables is typically more a convenience for getting
good initial weights than a necessity. However, if you have two or more
target variables and your error function is scale-sensitive like the usual
least (mean) squares error function, then the variability of each target
relative to the others can effect how well the net learns that target. If
one target has a range of 0 to 1, while another target has a range of 0 to
1,000,000, the net will expend most of its effort learning the second target
to the possible exclusion of the first. So it is essential to rescale the
targets so that their variability reflects their importance, or at least is
not in inverse relation to their importance. If the targets are of equal
importance, they should typically be standardized to the same range or the
same standard deviation.
The scaling of the targets does not affect their importance in training if
you use maximum likelihood estimation and estimate a separate scale
parameter (such as a standard deviation) for each target variable. In this
case, the importance of each target is inversely related to its estimated
scale parameter. In other words, noisier targets will be given less
importance.
For weight decay and Bayesian estimation, the scaling of the targets affects
the decay values and prior distributions. Hence it is usually most
convenient to work with standardized targets.
Standardization of cases should be approached with caution because it
discards information. If that information is irrelevant, then standardizing
cases can be quite helpful. If that information is important, then
standardizing cases can be disastrous. Issues regarding the standardization
of cases must be carefully evaluated in every application. There are no
rules of thumb that apply to all applications.
You may want to standardize each case if there is extraneous variability
between cases. Consider the common situation in which each input variable
represents a pixel in an image. If the images vary in exposure, and exposure
is irrelevant to the target values, then it would usually help to subtract
the mean of each case to equate the exposures of different cases. If the
images vary in contrast, and contrast is irrelevant to the target values,
then it would usually help to divide each case by its standard deviation to
equate the contrasts of different cases. Given sufficient data, a NN could
learn to ignore exposure and contrast. However, training will be easier and
generalization better if you can remove the extraneous exposure and contrast
information before training the network.
As another example, suppose you want to classify plant specimens according
to species but the specimens are at different stages of growth. You have
measurements such as stem length, leaf length, and leaf width. However, the
over-all size of the specimen is determined by age or growing conditions,
not by species. Given sufficient data, a NN could learn to ignore the size
of the specimens and classify them by shape instead. However, training will
be easier and generalization better if you can remove the extraneous size
information before training the network. Size in the plant example
corresponds to exposure in the image example.
If the data are measured on a ratio scale, you can control for size by
dividing each datum by a measure of over-all size. It is common to divide by
the sum or by the arithmetic mean. For positive ratio data, however, the
geometric mean is often a more natural measure of size than the arithmetic
mean. It may also be more meaningful to analyze the logarithms of positive
ratio-scaled data, in which case you can subtract the arithmetic mean after
taking logarithms. You must also consider the dimensions of measurement. For
example, if you have measures of both length and weight, you may need to
cube the measures of length or take the cube root of the weights.
Most importantly, nonlinear transformations of the targets are important
with noisy data, via their effect on the error function. Many commonly used
error functions are functions solely of the difference abs(target-output).
Nonlinear transformations (unlike linear transformations) change the
relative sizes of these differences. With most error functions, the net will
expend more effort, so to speak, trying to learn target values for which
abs(target-output) is large.
For example, suppose you are trying to predict the price of a stock. If the
price of the stock is 10 (in whatever currency unit) and the output of the
net is 5 or 15, yielding a difference of 5, that is a huge error. If the
price of the stock is 1000 and the output of the net is 995 or 1005,
yielding the same difference of 5, that is a tiny error. You don't want the
net to treat those two differences as equally important. By taking
logarithms, you are effectively measuring errors in terms of ratios rather
than differences, since a difference between two logs corresponds to the
ratio of the original values. This has approximately the same effect as
looking at percentage differences, abs(target-output)/target or
abs(target-output)/output, rather than simple differences.
It is usually advisable to choose an error function appropriate for the
distribution of noise in your target variables (McCullagh and Nelder 1989).
But if your software does not provide a sufficient variety of error
functions, then you may need to transform the target so that the noise
distribution conforms to whatever error function you are using. For example,
if you have to use least-(mean-)squares training, you will get the best
results if the noise distribution is approximately Gaussian with constant
variance, since least-(mean-)squares is maximum likelihood in that case.
Heavy-tailed distributions (those in which extreme values occur more often
than in a Gaussian distribution, often as indicated by high kurtosis) are
especially of concern, due to the loss of statistical efficiency of
least-(mean-)square estimates (Huber 1981). Note that what is important is
the distribution of the noise, not the distribution of the target values.
ART stands for "Adaptive Resonance Theory", invented by Stephen Grossberg in
1976. ART encompasses a wide variety of neural networks based explicitly on
neurophysiology. ART networks are defined algorithmically in terms of
detailed differential equations intended as plausible models of biological
neurons. In practice, ART networks are implemented using analytical
solutions or approximations to these differential equations.
PNN or "Probabilistic Neural Network" is Donald Specht's term for kernel
discriminant analysis. You can think of it as a normalized RBF network in
which there is a hidden unit centered at every training case. These RBF
units are called "kernels" and are usually probability density functions
such as the Gaussian. The hidden-to-output weights are usually 1 or 0; for
each hidden unit, a weight of 1 is used for the connection going to the
output that the case belongs to, while all other connections are given
weights of 0. Alternatively, you can adjust these weights for the prior
probabilities of each class. So the only weights that need to be learned are
the widths of the RBF units. These widths (often a single width is used) are
called "smoothing parameters" or "bandwidths" and are usually chosen by
cross-validation or by more esoteric methods that are not well-known in the
neural net literature; gradient descent is not used.
Specht's claim that a PNN trains 100,000 times faster than backprop is at
best misleading. While they are not iterative in the same sense as backprop,
kernel methods require that you estimate the kernel bandwidth, and this
requires accessing the data many times. Furthermore, computing a single
output value with kernel methods requires either accessing the entire
training data or clever programming, and either way is much slower than
computing an output with a feedforward net. And there are a variety of
methods for training feedforward nets that are much faster than standard
backprop. So depending on what you are doing and how you do it, PNN may be
either faster or slower than a feedforward net.
PNN is a universal approximator for smooth class-conditional densities, so
it should be able to solve any smooth classification problem given enough
data. The main drawback of PNN is that, like kernel methods in general, it
suffers badly from the curse of dimensionality. PNN cannot ignore irrelevant
inputs without major modifications to the basic algorithm. So PNN is not
likely to be the top choice if you have more than 5 or 6 nonredundant
inputs.
But if all your inputs are relevant, PNN has the very useful ability to tell
you whether a test case is similar (i.e. has a high density) to any of the
training data; if not, you are extrapolating and should view the output
classification with skepticism. This ability is of limited use when you have
irrelevant inputs, since the similarity is measured with respect to all of
the inputs, not just the relevant ones.
GRNN or "General Regression Neural Network" is Donald Specht's term for
Nadaraya-Watson kernel regression, also reinvented in the NN literature by
Schi\oler and Hartmann. You can think of it as a normalized RBF network in
which there is a hidden unit centered at every training case. These RBF
units are called "kernels" and are usually probability density functions
such as the Gaussian. The hidden-to-output weights are just the target
values, so the output is simply a weighted average of the target values of
training cases close to the given input case. The only weights that need to
be learned are the widths of the RBF units. These widths (often a single
width is used) are called "smoothing parameters" or "bandwidths" and are
usually chosen by cross-validation or by more esoteric methods that are not
well-known in the neural net literature; gradient descent is not used.
GRN is a universal approximator for smooth functions, so it should be able
to solve any smooth function-approximation problem given enough data. The
main drawback of GRNN is that, like kernel methods in general, it suffers
badly from the curse of dimensionality. GRNN cannot ignore irrelevant inputs
without major modifications to the basic algorithm. So GRNN is not likely to
be the top choice if you have more than 5 or 6 nonredundant inputs.
Unsupervised learning allegedly involves no target values. In fact, for most
varieties of unsupervised learning, the targets are the same as the inputs
(Sarle 1994). In other words, unsupervised learning usually performs the
same task as an auto-associative network, compressing the information from
the inputs (Deco and Obradovic 1996). Unsupervised learning is very useful
for data visualization (Ripley 1996), although the NN literature generally
ignores this application.
Unsupervised competitive learning is used in a wide variety of fields under
a wide variety of names, the most common of which is "cluster analysis" (see
the Classification Society of North America's web site for more information
on cluster analysis, including software, at http://www.pitt.edu/~csna/.) The
main form of competitive learning in the NN literature is vector
quantization (VQ, also called a "Kohonen network", although Kohonen invented
several other types of networks as well--see "How many kinds of Kohonen
networks exist?" which provides more reference on VQ). Kosko (1992) and
Hecht-Nielsen (1990) review neural approaches to VQ, while the textbook by
Gersho and Gray (1992) covers the area from the perspective of signal
processing. In statistics, VQ has been called "principal point analysis"
(Flury, 1990, 1993; Tarpey et al., 1994) but is more frequently encountered
in the guise of k-means clustering. In VQ, each of the competitive units
corresponds to a cluster center (also called a codebook vector), and the
error function is the sum of squared Euclidean distances between each
training case and the nearest center. Often, each training case is
normalized to a Euclidean length of one, which allows distances to be
simplified to inner products. The more general error function based on
distances is the same error function used in k-means clustering, one of the
most common types of cluster analysis (MacQueen 1967; Anderberg 1973). The
k-means model is an approximation to the normal mixture model (McLachlan and
Basford 1988) assuming that the mixture components (clusters) all have
spherical covariance matrices and equal sampling probabilities. Normal
mixtures have found a variety of uses in neural networks (e.g., Bishop
1995). Balakrishnan, Cooper, Jacob, and Lewis (1994) found that k-means
algorithms used as normal-mixture approximations recover cluster membership
more accurately than Kohonen algorithms.
Hebbian learning is the other most most common variety of unsupervised
learning (Hertz, Krogh, and Palmer 1991). Hebbian learning minimizes the
same error function as an auto-associative network with a linear hidden
layer, trained by least squares, and is therefore a form of dimensionality
reduction. This error function is equivalent to the sum of squared distances
between each training case and a linear subspace of the input space (with
distances measured perpendicularly), and is minimized by the leading
principal components (Pearson 1901; Hotelling 1933; Rao 1964; Joliffe 1986;
Jackson 1991; Diamantaras and Kung 1996). There are variations of Hebbian
learning that explicitly produce the principal components (Hertz, Krogh, and
Palmer 1991; Karhunen 1994; Deco and Obradovic 1996; Diamantaras and Kung
1996).
During learning, the outputs of a supervised neural net come to approximate
the target values given the inputs in the training set. This ability may be
useful in itself, but more often the purpose of using a neural net is to
generalize--i.e., to have the outputs of the net approximate target values
given inputs that are not in the training set. Generalizaton is not always
possible, despite the blithe assertions of some authors. For example,
Caudill and Butler, 1990, p. 8, claim that "A neural network is able to
generalize", but they provide no justification for this claim, and they
completely neglect the complex issues involved in getting good
generalization.
There are three conditions that are typically necessary (although not
sufficient) for good generalization.
The first necessary condition is that the inputs to the network contain
sufficient information pertaining to the target, so that there exists a
mathematical function relating correct outputs to inputs with the desired
degree of accuracy. You can't expect a network to learn a nonexistent
function--neural nets are not clairvoyant! For example, if you want to
forecast the price of a stock, a historical record of the stock's prices is
rarely sufficient input; you need detailed information on the financial
state of the company as well as general economic conditions, and to avoid
nasty surprises, you should also include inputs that can accurately predict
wars in the Middle East and earthquakes in Japan. Finding good inputs for a
net and collecting enough training data often take far more time and effort
than training the network.
The second necessary condition is that the function you are trying to learn
(that relates inputs to correct outputs) be, in some sense, smooth. In other
words, a small change in the inputs should, most of the time, produce a
small change in the outputs. For continuous inputs and targets, smoothness
of the function implies continuity and restrictions on the first derivative
over most of the input space. Some neural nets can learn discontinuities as
long as the function consists of a finite number of continuous pieces. Very
nonsmooth functions such as those produced by pseudo-random number
generators and encryption algorithms cannot be generalized by neural nets.
Often a nonlinear transformation of the input space can increase the
smoothness of the function and improve generalization.
For classification, if you do not need to estimate posterior probabilities,
then smoothness is not theoretically necessary. In particular, feedforward
networks with one hidden layer trained by minimizing the error rate (a very
tedious training method) are universally consistent classifiers if the
number of hidden units grows at a suitable rate relative to the number of
training cases (Devroye, Gy\"orfi, and Lugosi, 1996). However, you are
likely to get better generalization with realistic sample sizes if the
classification boundaries are smoother.
For Boolean functions, the concept of smoothness is more elusive. It seems
intuitively clear that a Boolean network with a small number of hidden units
and small weights will compute a "smoother" input-output function than a
network with many hidden units and large weights. If you know a good
reference characterizing Boolean functions for which good generalization is
possible, please inform the FAQ maintainer (sas...@unx.sas.com).
The third necessary condition for good generalization is that the training
cases be a sufficiently large and representative subset ("sample" in
statistical terminology) of the set of all cases that you want to generalize
to (the "population" in statistical terminology). The importance of this
condition is related to the fact that there are, loosely speaking, two
different types of generalization: interpolation and extrapolation.
Interpolation applies to cases that are more or less surrounded by nearby
training cases; everything else is extrapolation. In particular, cases that
are outside the range of the training data require extrapolation. Cases
inside large "holes" in the training data may also effectively require
extrapolation. Interpolation can often be done reliably, but extrapolation
is notoriously unreliable. Hence it is important to have sufficient training
data to avoid the need for extrapolation. Methods for selecting good
training sets are discussed in numerous statistical textbooks on sample
surveys and experimental design.
Thus, for an input-output function that is smooth, if you have a test case
that is close to some training cases, the correct output for the test case
will be close to the correct outputs for those training cases. If you have
an adequate sample for your training set, every case in the population will
be close to a sufficient number of training cases. Hence, under these
conditions and with proper training, a neural net will be able to generalize
reliably to the population.
If you have more information about the function, e.g. that the outputs
should be linearly related to the inputs, you can often take advantage of
this information by placing constraints on the network or by fitting a more
specific model, such as a linear model, to improve generalization.
Extrapolation is much more reliable in linear models than in flexible
nonlinear models, although still not nearly as safe as interpolation. You
can also use such information to choose the training cases more efficiently.
For example, with a linear model, you should choose training cases at the
outer limits of the
...
read more »