Okay, I know I need to get out more but ...
I've just spent an enjoyable and educational Sunday afternoon catching
up on a highly focussed but (to me) fascinating fragment of
linguistics / paleographics.
As is so often the case, the instigation was a Language Log posting,
this time by Mark Liberman on "Conditional entropy and the Indus
script" [1]
The posting interested me for a couple of reasons that I suspect might
also be relevant to this SIG.
(I hope you'll forgive me for not attempting to summarise the actual
work because to do so would be to risk trivialising it, I can only
suggest that you read the LL post.)
In the LL posting, Mark presents an excerpt from Rao et al's recent
published refutation (of Farmer at al's argued refutation of the
"Indus-script" thesis) in which Rao et al uses conditional entropy [3]
to support a counter-claim that the Indus symbols do demonstrate
linguistic structure.
The excerpt of Farmer et al's response included by Mark Liberman ends
thus:
"If the paper had been properly peer reviewed it would not have been
published."
As Liberman notes, "Strong words."
I was unable to resist Mark's invitation to read the papers and
attempt to gain for myself an understanding of the concepts involved
(regrettably the Lawler paper referenced is inaccessible behind a
paywall but Farmer et al's original refutation paper is readily
available [3] and some digging produced the Rao et al paper and the
corresponding Supporting Online Material [4])
Of possible interest to this group is Mark's declaration of intent to
return to the topic later --- "to de-mystify the concept of
"conditional entropy", and show you how to replicate such experiments
yourself, if you care to".
For my own part, I look forward to the de-mystification and, as I
would care to indulge in a little replication, I have prepared the
ground by installing the University of Edinburgh's "Maximum Entropy
Modeling Toolkit for Python and C++" [5] on my Macbook Pro (I decided
to add the optional gcc fortran library [6] to take advantage of the
extra speed offered, on the basis that it will probably be needed).
The artificial data sets constructed by Rao et al. have been
criticized as being "meaningless". There's at least a couple of
workers in NL on this list and I'd be grateful for some pointers to
the construction of more plausible/useful datasets, if time/
inclination permits.
The other point of potential relevance to this group is more by of a
caveat. Rao et al's choice of point of attack (the degree of rigour of
sequential structuring) seems rather ill-supported.
"Here we compare the statistical structure of sequences of signs in
the Indus script with those from a representative group of linguistic
and nonlinguistic systems.
Two major types of nonlinguistic systems are those that do not exhibit
much sequential structure (“Type 1” systems) and those that follow
rigid sequential order (“Type 2” systems). For example, the
sequential order of signs in Vinča inscriptions appears to have been
unimportant (4). On the other hand, the sequences of deity signs in
Near Eastern inscriptions found on boundary stones (kudurrus)
typically follow a rigid order that is thought to reflect the
hierarchical ordering of the deities (5).
Linguistic systems tend to fall somewhere between these two extremes."
The references used by Rao et al are (4) S. M. M. Winn, in The Life of
Symbols and (5) J. A. Black and A. Green, Gods, Demons and Symbols of
Ancient Mesopotamia.
Black and Green's book [7] is described as "an illustrated dictionary"
and, as far as I can ascertain from an examination of Amazon's "look
inside this book", it does not contain the usual raft of supporting
references that one would wish to see. The Shan M. M. Winn reference
is reportedly difficult to obtain but there is an online update [8]:
Shan M. M. Winn:
"it is apparent that many readers interested in the script have no
access to the 1990 publication of "A Neolithic Sign System in
Southeastern Europe"; therefore, relevant portions from that article
have been introduced at demarcated points in the following article."
"Sign groups occur principally on spindle whorls and to a lesser
extent on pottery, but a small number of tablet-like objects and
figurines are marked with groups of signs. Sign groups on pottery
usually consist of only two signs, though there is ample space for
more; in contrast, numerous signs are incised on whorls, despite the
limited available space.
Neither the order nor the direction of the signs in these groups is
readily determinable; moreover, judging by the frequent lack of
arrangement, precision in the order probably was unimportant."
I'd find it hard to argue that either reference offers sufficiently
solid support for Rao et al's adoption of them as boundaries for the
degree of rigour of sequential structure. Neither do they seem
terribly well-supported as members of a "representative group of
nonlinguistic systems". The "Indus-script thesis" is an important
issue both academically and politically [9], workers in the field have
a responsibility to ensure that the underpinning assumptions of their
work are properly grounded and supported.
Even allowing that conditional entropy /is/ a valid stochastic
analytical technique to use in this specific linguistic/symbolic
context, Wired is really pushing the bounds of credulity with "An
ancient script that's defied generations of archaeologists has yielded
some of its secrets to artificially intelligent computers." Over-eager
hyperbole doesn't further AI, it merely misleads people as to the true
state of affairs. New Scientist [10] is more appropriately
circumspect, terming the approach "a new mathematical analysis".
I'm also a little skeptical of Wired's apparent quoting of Rao: "It's
only recently that archaeologists have started to apply computational
approaches in a rigid manner." Should that be "rigorous" perhaps?
Even so; Sproat and Koskenniemi, workers on both sides of the
argument, seem to concur that stochastic techniques aimed at frequency
and recurrence analysis are of dubious relevance: "Plain statistical
tests such as the distribution of sign frequencies and plain
reoccurrencies can (a) neither prove that the signs represent writing,
(b) nor prove that the signs do not represent writing. Falsifying
being equally impossible as proving." [11]
(Mind you, I'm not sure what these domain experts mean by "plain
statistical tests" - is there an independent scale somewhere that I
can consult?)
I look forward to learning more from Mark Liberman's de-mystification
of the general technique.
In the course of the browsing required to construct this post, I came
across the list of accepted papers for this year's ACL-IJCNLP [12]
which I think makes fascinating reading of itself - so much very high-
grade effort being expended on the reverse-engineering what is in
effect an emergently-generated, organic communications protocol. I'm
still in awe of the power, variety and complexity of natural language
(spoken and written) and of the fact that it was developed ad hoc by
comparatively unsophisticated peoples equipped "only" with their
native wit. For my money, AI has a looong way to go yet before it even
begins to look plausible.
Cheers,
Graham (in part, trying to ensure that Noah has something to blog about)
[1] http://languagelog.ldc.upenn.edu/nll/?p=1374
[2] http://en.wikipedia.org/wiki/Conditional_entropy
[3] http://www.safarmer.com/fsw2.pdf
[4] http://www.cs.washington.edu/homes/rao/ScienceIndus.pdf
[5] http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html
[6] http://r.research.att.com/tools/
[7] http://www.amazon.com/Gods-Demons-Symbols-Ancient-Mesopotamia/dp/0292707940
[8] http://www.prehistory.it/ftp/winn.htm
[9] http://www.flonnet.com/fl1801/18010730.htm
[10] http://blog.wired.com/wiredscience/2009/04/indusscript.html
[11] http://www.newscientist.com/article/dn17012-scholars-at-odds-over-mysterious-indus-script.html
[12] http://www.acl-ijcnlp-2009.org/main/acceptedfullpapers.html
-----BEGIN PGP SIGNATURE-----
iEYEARECAAYFAkn1HmgACgkQOsmLt1Nhivw6sgCgo5pjFLPTGMHCl9MotLhTLw1p
CSsAnjrnIJpDBNheXJUp4b0GVpfFyuFQiQCVAgUBSfUeaFnrWVZ7aXD1AQL1ggQA
0zSZy7/XQ6aQp6h+jtN0QVKgGC+6eZTYocq0ATZn0HssywB1F6GlLX2k6Z2PDUG0
ALX0SO3X8n+RUmkjvtgcZLIRTY0uTJf5QkfO4h/T7xOpPjSaRTyjx62MqImRdIMW
vXFXUO6wi38m1+rj0mvXThskoswkQjqfvjccf424hK0=
=ZRO+
-----END PGP SIGNATURE-----
Though I have only absorbed a little of Graham's readings, I wanted to
respond to his conclusion and make a point about AI and this group.
Graham:
> ... much high-grade effort [is] being expended on the reverse-
> engineering what is in effect an emergently-generated, organic
> communications protocol. I'm still in awe of the power, variety and
> complexity of natural language (spoken and written) and of the fact
> that it was developed ad hoc by comparatively unsophisticated
> peoples equipped "only" with their native wit. For my money, AI has
> a looong way to go yet before it even begins to look plausible.
We generally speak of intelligence as diffuse, undefined and abstract,
but we can ground the discussion in one specific intelligence - the
human capacity for language. Any artificial intelligence is based on
natural language, so my criterion for a project or a group to be
called AI is that, along side the practical questions, it ask the big
questions: How is this artifice like nature's language? How will it
move us on the path to human-like intelligence? What does it mean to
be human?
I take a broad view of language:
- Language is inseparable from the human ability to make and
manipulate things. All of culture and technology has been accomplished
through language.
- Language encompasses our whole evolutionary history. Language is the
biological adaptation unique to humans. And this grew on and can only
be understood in terms of our biological substrate - the earlier
intelligence.
In a profound way, humans invented themselves by tinkering with
language. Our efforts in AI are a continuation of this invention.
[Aside] A classic I'd like to re-read:
The sciences of the artificial by Herbert SImon
http://books.google.com/books?id=k5Sr0nFw7psC
[Next post, Python...]
Rick