Use an online tokenizer playground (like
https://tokenizer.model.box/) to understand how LLMs convert text/images/audio into a series of numbers which they then process using mathematical computations.
---
Here is an example of the gpt-4o-mini tokenizer from that site:
The empty system message and the user message बालक! बालकः पठति। बालकाः पठन्ति। बालिका अपि पठति। अहं अपि पठामि। are together converted to this syntax (which is what the LLMs are trained on):
<|im_start|>system<|im_sep|><|im_end|><|im_start|>user<|im_sep|>बालक! बालकः पठति। बालकाः पठन्ति। बालिका अपि पठति। अहं अपि पठामि।<|im_end|><|im_start|>assistant<|im_sep|>
Which is then converted to this series of numbers/tokens:
200264, 17360, 200266, 200265, 200264, 1428, 200266, 191301, 1016, 0, 64081, 1016, 41582, 164888, 16199, 1670, 64081, 1016, 721, 225, 164888, 33843, 785, 1670, 64081, 27679, 9250, 785, 164888, 16199, 1670, 52807, 1004, 9250, 785, 164888, 5195, 785, 1670, 200265, 200264, 173781, 200266
For the LLM:
- बाल = 191301
- बाल prefixed with space = 64081
- पठ prefixed with space = 164888
- क as a suffix = 1016
- ः (visarga) as a suffix = 41582
And so on.
---
Now take the exact same set of Sanskrit sentences in ISO-15919 romanization:
bālaka! bālakaḥ paṭhati. bālakāḥ paṭhanti. bālikā api paṭhati. ahaṁ api paṭhāmi
and convert them to the syntax:
<|im_start|>system<|im_sep|><|im_end|><|im_start|>user<|im_sep|>bālaka! bālakaḥ paṭhati. bālakāḥ paṭhanti. bālikā api paṭhati. ahaṁ api paṭhāmi<|im_end|><|im_start|>assistant<|im_sep|>
Now you get these tokens:
200264, 17360, 200266, 200265, 200264, 1428, 200266, 65, 118479, 3578, 0, 287, 118479, 3578, 59767, 2428, 106920, 71, 3009, 13, 287, 2485, 42836, 2485, 59767, 2428, 106920, 71, 9590, 13, 287, 2485, 6720, 2485, 11379, 2428, 106920, 71, 3009, 13, 29574, 114298, 11379, 2428, 106920, 71, 2485, 3900, 200265, 200264, 173781, 200266
Here, bālakaḥ = b (287) + āl (118479) + aka (3578) + ḥ (59767)
But bālakāḥ = b (287) + ā (2485) + lak (42836) + ā (2485) + ḥ (59767)
And bālikā = b (287) + ā (2485) + lik (6720) + ā (2485)
---
So, LLMs only understand these numbers and their relations to each other. LLMs are created by first tokenizing any content it is to be trained on and then performing trillions and trillions of computations on the input which is then reduced to a set of weights (numbers, again) which embed (in an invisible-to-us way) the relations between all the tokens it has seen.
When you "run" an LLM, a particular kind of software (a host/runner) loads these weights into memory, applies your tokenized input to it and performs computations based on specific algorithms to produce a stream of tokens which is then converted back to human-readable text.
This is what an LLM essentially is at a very basic level. And this is a result of the way digital computers operate.
So, no, you cannot skip tokenization and teach computers Sanskrit directly because they currently understand numbers and only numbers.
If people want to know more about computing and numbers, they can read Charles Petzold's Code: The Hidden Language of Computer Hardware and Software