Cuneiformis the oldest known writing system. It was used inMesopotamia (modern day Iraq) for over 3,000 years. It was used towrite Sumerian, Akkadian, and other languages. Written on clay,it has survived the millennia and is now being translated byscholars around the world.
In 1857 a new cylinder inscribed with cuneiform textand the name Tiglath-Pileser was found (dated 1150 BC).At this time, cuneiform was just being relearned andthere was a question as to how good various translation methods were.
The Royal Asiatic Society decided to perform an experiment(that was later published as the bookInscription of Tiglath Pilser I., King of Assyria).They would give the same inscription to three different translatorsand see how well they did. The idea was that if the translationswere similar, then the current understanding of cuneiform was also good.
Ashur, great lord, who makes the totality of the gods submit, who gives scepter and crown, who establishes kingship; Enlil, lord, king of all the Anunnaku gods, father of the gods, lord of the lands; Sin, wise one, lord of the crown, saqu-worthy; Magur, Shamash, judge of heaven and underworld, who carries out the slander of the enemy, who breaks up the foe; Adad, hero, conqueror of the four quarters of the lands, the four quarters; Ninurta, hero, villainous sacrificial spirit and enemy, who destroys the heart of the people; Ishtar, foremost among the gods, lady of battle;
The great gods, who make the heavens and earth a lordly place, whose utterances are a scepter and a scepter, who make kingship supreme, Tiglath-Pileser, beloved prince, your beloved, your shepherd, who by your true heart you have entrusted to me, this exalted one, you have established for the sovereignty of the land of the great Enlil, you have granted him a scepter.
Existing online repositories (CDLI, Oracc)contain many transliterations of ancient cuneiform texts(a transliteration is a rewriting of a text from one writing systemto another without changing the language), but they are very lackingin the translations department.
While I am not a cuneiform expert, I am an expert at neural networksand have a deep passion for languages and writing systems.I want any person to have access to the archives ofthe ancients.A grandiose goal for sure, but also a very achievable one thanks tomodern engineering advancements.
Consider Sumerian (spoken by the creators of cuneiform).There are currently 103,075 texts published withtransliterations from cuneiform symbols to (mostly) latin letters.But only 4,583 of these texts have publicly available translations online.That is a mere 4% of texts available to a lay person such as myself.
Ignoring the absurdly large LLMs that are dominating the field now (GPT-4 and friends), the humble smaller transformers are still quite powerful and have made the problem of translation a somewhat trivial.
My favorite one of these is the T5 network from Google. While large itself, it is capable of being trained using off-the-shelf (though expensive) GPUs. If you can build a large training set, you can train this network at home to accomplish wonders.
As any machine learning expert will tell you, 90% of the problem is collecting a good training dataset (the other 10% is justifying the compute bill). Building the cuneiform dataset presented its own unique set of challenges.
Sadly, Assyriologists took some time to settle on a consistent transliteration system. When works were first transliterated to a digital form, only ASCII characters were available and the researchers made due using funny characters like # to denote demonstratives, numbers to disambiguate symbols, and ALL CAPS whenever they were in the mood (just kidding, but the use is so random it might as well be).
When other character encodings became available, researchers adapted. They started to use diacritic marks to disambiguate symbols (loosely based on guessed sounds). And then HTML was invented and they went wild with special marks attempting to better capture the original writing.
While starting with a pre-trained model saves me a lot of compute time,it has drawbacks. The pre-trained model was trained to translatefrom English to French or German. Ideally, I would have a modelthat was pre-trained to translate to English.
I found a regularization strategy that helped a lot. I wouldtrain it to also translate from English to Sumerian and Akkadian.Doing this helped the network to always converge.I assume this is an affect of using the pre-trained network.
Side note: If you are an academic and would like to collaborate on this project,please reach out to me by filing issues on GitHub. I have a million questions about cuneiformthat I would love to ask you.
The Akkadian language is one of the earliest known Semitic languages, a family that includes modern languages such as Arabic and Hebrew. It was spoken in ancient Mesopotamia, primarily in the Akkadian Empire that was situated in the region that is today parts of Iraq and northeastern Syria. Akkadian is named after the ancient city of Akkad, one of the major centers of the Akkadian civilization.
Akkadian was used for a wide range of purposes, from administrative and legal documents to literature and science texts. It was written using cuneiform script on clay tablets, and its decipherment in the 19th century opened up a new window into the ancient world, providing scholars with valuable insights into the history, culture, and scientific achievements of the time.
The full decipherment of cuneiform took over 200 years, from 1802 to 2022. The story starts with the so-called Behistun Inscription. Discovered in Iran and dating back to the time of King Darius I of Persia (550 BC), this multilingual inscription included three types of script: Old Persian, Elamite, and Akkadian cuneiform. Old Persian was deciphered first, providing clues for the other two.
This is a follow-up to a previous study by Gordin and colleagues that also looked at how AI can be used to translate cuneiform. This time, two versions of the model were trained. The first one translates the Akkadian from cuneiform representations into Lain script (called transliteration). The other version translates from unicode representations of cuneiform signs (which is how cuneiform is often digitized).
In this case, the AI did a great job of translating most of the content. However, an error that likely occurred when cleaning the data for training caused the AI to miss the negation, completely altering the meaning of the sentence.
In the majority of cases, however, the translation was very useful as a first-pass of the text. Researchers say the AI can be used by scholars or even by students who want to study this language in more detail.
I have written previously (here and here) about my interest for the Ur III period, a Sumerian dynasty characterized by an abundant number of administrative documents in the form of clay tablets written using cuneiform scripts. We know about 65,000 such documents, which record various types of transactions. In addition, thanks to initiatives such as CDLI (the Cuneiform Digital Library Initiative) and ORACC (Open Richly Annotated Cuneiform Corpus), these texts (and many others) are available in digital form, and may be exploited using modern data analysis techniques.
In any supervised machine translation learning task, one needs a dataset of sequences in the original language and their translation in the destination language (the ground truth). Thankfully, the CDLI has not only digitized and transliterated sumerian texts, but in some cases an English translation is also provided. In addition, the entire CDLI text data is dumped daily on this GitHub repository, and can be parsed fairly easily.
I decided to train the neural translation model at the character level, and not at the word level. This is not usually the case in the majority of the neural translation papers I have seen. One of the reason for this is that the length of sequences would become extremely long, and deep learning models using Recurrent Neural Networks (RNN) would have difficulties with long-range dependencies (even with cell structures such as LSTMs or GRUs). However, in the case of the sumerian corpus, the sentences are short and do not exceed 40-60 characters in average, which is acceptable for LSTMs/GRUs. One could also wonder whether a word-level model is adapted to transliterated sequences coming from logosyllabic cuneiform, although I have no definite answer to this.
Both models are based on an encoder/decoder structure. In the case of Seq2Seq, the input sequence is fed character-per-character to an encoder network of stacked LSTM (or GRU) layers, which encodes the sequence into a fixed-size vector. This vector is used by the decoder network (also made of stacked LSTM layers) to predict the output sequence character-per-character in an auto-regressive fashion. This model has been a turning point in neural translation but is nowadays considered as quite limited since it is difficult to imagine that one can encode the whole meaning of a sentence into a vector. Later models added attention mechanisms, i.e. ways to ponder the importance of a word and its neighbors in the translation. The Transformer model of Google gets rid of the recurrent nature of the network to keep only the attention mechanism, which in their case improves both the training speed and the translation quality.
Not much, really. As I said above, this was simply an experiment for the sake of fun, so I did not have very high hopes, especially given the hardware limitations. In addition, there are other limitations coming from the data itself:
And now for the results. After training the model, I asked for the translations on unseen tablets from the CDLI corpus where no English translation was present. We begin with some good cases. Here is CDLI n P131719:
As seen above, the Seq2Seq model encodes any sentence into a series of fixed-size vectors, the encoder states, which are then used by the decoder to produce the translation. We can study these representations to see what happens after training. I used UMAP to perform dimensionality reduction and project the 43552 vectors of size 1024 of the training set into two dimensions. The scatter plot below shows each one of these projections.
3a8082e126