Well said, Sir. In olden days, individuals did not have availability of data, compute, knowhow and free time to start projects on their own. Now due to open source and due to
improvements in each of these, it is thinkable.
All great work begins with committed individual's relentless efforts. When such individuals team up without ego, this gets magnified immensely
Until 2015 state of the art systems for pattern recognition (image recognition, speech recognition, translation, speech synthesis, OCR) relied on enormous amount of human engineered features
and human logic programmed into the systems. The Deep learning systems are end to end trained from raw data to expected output. They not only learn the classification but also appropriate feature engineering
necessary. This may so counterintuitive (how can a system learn features) on one sense, skeptical and discouraging for experts who have spent decades in building such systems (have all my work gone waste?)
but it is mind boggling and it works.
As an example, in neural OCR you give an entire line of scanned text and its corresponding transcripted unicode as a pair and the system not only learns character recognition
but font recognition, character segmentation, word segmentation. Google Tesseract OCR is available full system that we can train that has these pipelines coded.
Another example, in neural text synthesis (state of the art is Tacotron from Deepmind) can take pairs of typed text and corresponding utterances it learns an end to end network
that internally learns all required modules (parsing the text, identifying character or letter as needed, syllable mapping, acoustic translation, adjusting the synthesized text for smoothness
and believability). Human engineered systems needed design of each module and then tying them together and fine tuning each.
Traditional Translation techniques required human coding of dictionaries and grammar and translation rules and exceptions. The state of the art Seq2seq with attention or transformer,
does not need any such parts, it is end to end and it automatically learned vocabulary, meanings, grammar, rules, exceptions by adjusting self-performance against a large corpus.
This is the best highlight. Totally agree. Only in trying to build from scratch we begin to understand anything.
Rebuilding not only improves systems but it improves each of our ability and also gives ideas to create new applications.
In modern ML it is easy to get discouraged by the complexity and level of performance of modern systems, but one should always
try to start from scratch and code upwards
However, at some point AFTER trying oneself, one knows how to incorporate other peoples building blocks.
If one starts to build a translation system by scratch they cannot go farther, but only that experience will give the perspective
and appreciation of all the things involved and give the courage to use other tools on our own outlines in our own agenda, without being
totally dependent and carried away.
I would like to give two examples:
1. Vanangamudi's crawlers are so simple and straightforward and self contained because they are coded from scratch.
2. The word2vec browing app that team has put together is very very creative (click on a word see its friends and enemies).
3. TShrinivasan has a wikibook OCR module that cleverly takes in a pdf splits into pages, splits pages to columns, sends to good docs, get tamil ocr
and update that text on wiki! This is a very bottom up application done well. Of course one can go ahead and make it functional modular etc,
but I was amazed by these innovations.
Update from myself:
1. I was able to write a wikidump processor to extract all wiki tam articles.
2. Create a sentence piece model for automatical tamil tokenization
3. Build a ULMFIT tamil LM with lowest perplexity of 37.
I will try to make some time this weekend to create two versions of this on my git
1. End to end single notebook that can be used for any language.
2. Three modules: wikita-extractor, tam-sp-tokenizer, tam-lm
But whoever also wants to do ULMFit please continue, we might get alternate approaches, but we
can start with one baseline.
I also have a few notebooks on Tamil OCR. Will try to organize and make them useful and usable and publish them
in the next few weeks.
Thanks
Ravi