Contribution to AI

45 views
Skip to first unread message

PRERNA SHUKLA

unread,
Sep 13, 2025, 11:58:16 PMSep 13
to sanskrit-p...@googlegroups.com
Hello team,

Not sure to whom i am writing this but while chatting with gemini about tokenization i was curious what if that not required for AI to do so it should be taught data in sanskrit since langauge is code itself machine dont have to convert it into numbers and directly interpret the data we  input in sanskrit and then it will be more sorted gemini suggested me this groplup as welcoming

I know sanskrit hindi english and few reasonal langauges too
I am a middleware administaratornhavingvknowledge of unix/linux command
Looking forward to contribute in anyway to the growth of AI

Thanks,
Prerna shukla

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Sep 14, 2025, 12:39:24 AMSep 14
to sanskrit-p...@googlegroups.com, prer...@gmail.com
namaste, preraNA - unless you subscribe to the mailing list, you will not get responses by email.

We don't contribute to the growth of AI here - we just use AI for sanskrit purposes. If you are interested in sanskrit, it might be a good idea to proofread sanskrit texts so as to get a feel for what is useful.

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sanskrit-programmers/CAB89U2J94jA-tNBGDRbxEktXrNefBbRf%3D2Lbe%2BKkts2TWWh50A%40mail.gmail.com.


--
--
Vishvas /विश्वासः

Karthik

unread,
Sep 14, 2025, 9:26:22 PMSep 14
to sanskrit-programmers
Digital computers run on electrical signals which we map to bits and then to numbers on a higher level. These are just representations. Our current machines have been designed to operate very efficiently on numbers. So we map EVERYTHING to numbers: language, color, sound, temperature, pressure etc.

Representation has no real meaning outside the mind. So Sanskrit exists only in our mind. On a page, Sanskrit is a set of symbols printed using an alphabet you understand (currently Devanagari is popular). In a computer file, it is a sequence of bits representing Devanagari (or any other representation like IAST).

This is why you cannot teach LLMs Sanskrit (or any language) directly. The framework it currently operates on understands only numbers. So tokens will always be numbers. And you will have to tokenize your input.

Anunad Singh

unread,
Sep 15, 2025, 7:14:51 AMSep 15
to sanskrit-p...@googlegroups.com
To me, the original question seems to be written in machine language and its reply in a ultra high level language. Could not understand what was asked and what was said as a reply.

- anunAda

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.

vishal jaiswal

unread,
Sep 15, 2025, 10:43:50 AMSep 15
to sanskrit-p...@googlegroups.com
"Could not understand what was asked and what was said as a reply." 

Same here!

Karthik

unread,
Sep 16, 2025, 4:57:57 AMSep 16
to sanskrit-programmers
Use an online tokenizer playground (like https://tokenizer.model.box/) to understand how LLMs convert text/images/audio into a series of numbers which they then process using mathematical computations.

---

Here is an example of the gpt-4o-mini tokenizer from that site:

The empty system message and the user message बालक! बालकः पठति। बालकाः पठन्ति। बालिका अपि पठति। अहं अपि पठामि। are together converted to this syntax (which is what the LLMs are trained on):

<|im_start|>system<|im_sep|><|im_end|><|im_start|>user<|im_sep|>बालक! बालकः पठति। बालकाः पठन्ति। बालिका अपि पठति। अहं अपि पठामि।<|im_end|><|im_start|>assistant<|im_sep|>

Which is then converted to this series of numbers/tokens:

200264, 17360, 200266, 200265, 200264, 1428, 200266, 191301, 1016, 0, 64081, 1016, 41582, 164888, 16199, 1670, 64081, 1016, 721, 225, 164888, 33843, 785, 1670, 64081, 27679, 9250, 785, 164888, 16199, 1670, 52807, 1004, 9250, 785, 164888, 5195, 785, 1670, 200265, 200264, 173781, 200266

For the LLM:

- बाल = 191301
- बाल prefixed with space = 64081
- पठ prefixed with space = 164888
- क as a suffix = 1016
- ः (visarga) as a suffix = 41582

And so on.

---

Now take the exact same set of Sanskrit sentences in ISO-15919 romanization:

bālaka! bālakaḥ paṭhati. bālakāḥ paṭhanti. bālikā api paṭhati. ahaṁ api paṭhāmi

and convert them to the syntax:

<|im_start|>system<|im_sep|><|im_end|><|im_start|>user<|im_sep|>bālaka! bālakaḥ paṭhati. bālakāḥ paṭhanti. bālikā api paṭhati. ahaṁ api paṭhāmi<|im_end|><|im_start|>assistant<|im_sep|>

Now you get these tokens:

200264, 17360, 200266, 200265, 200264, 1428, 200266, 65, 118479, 3578, 0, 287, 118479, 3578, 59767, 2428, 106920, 71, 3009, 13, 287, 2485, 42836, 2485, 59767, 2428, 106920, 71, 9590, 13, 287, 2485, 6720, 2485, 11379, 2428, 106920, 71, 3009, 13, 29574, 114298, 11379, 2428, 106920, 71, 2485, 3900, 200265, 200264, 173781, 200266

Here, bālakaḥ = b (287) + āl (118479) + aka (3578) + ḥ (59767)
But bālakāḥ = b (287) + ā (2485) + lak (42836) + ā (2485) + ḥ (59767)
And bālikā = b (287) + ā (2485) + lik (6720) + ā (2485)

---

So, LLMs only understand these numbers and their relations to each other. LLMs are created by first tokenizing any content it is to be trained on and then performing trillions and trillions of computations on the input which is then reduced to a set of weights (numbers, again) which embed (in an invisible-to-us way) the relations between all the tokens it has seen.

When you "run" an LLM, a particular kind of software (a host/runner) loads these weights into memory, applies your tokenized input to it and performs computations based on specific algorithms to produce a stream of tokens which is then converted back to human-readable text.

This is what an LLM essentially is at a very basic level. And this is a result of the way digital computers operate.

So, no, you cannot skip tokenization and teach computers Sanskrit directly because they currently understand numbers and only numbers.

If people want to know more about computing and numbers, they can read Charles Petzold's Code: The Hidden Language of Computer Hardware and Software

Anunad Singh

unread,
Sep 16, 2025, 5:33:46 AMSep 16
to sanskrit-p...@googlegroups.com
Karthik, what you are saying is perfectly correct. But it is a  very general explanation of how ANY computation is done on a computer or on a digital device. A drawing file, an audio file, a video file  everything is converted into numbers and numbers and finally to 1's and 0's (as it is said). And finally, it is also not 1's and 0's , but 'logic levels' of voltages.

In short, what you have said is not anyway specific to, or distinguishing features of LLMs.

-- anunAda

Anunad Singh

unread,
Sep 16, 2025, 5:41:18 AMSep 16
to sanskrit-p...@googlegroups.com
Please read my previous messages as the following-

Karthik, what you are saying is perfectly correct. But it is a  very general explanation of how ANY computation is done on a computer or on a digital device. A drawing file, an audio file, a video file --  everything is converted into numbers and those numbers are finally converted to 1's and 0's (as it is said). And finally, it is also not 1's and 0's , but 'logic levels' of voltages that are responsible for the final outcome.

In short, what you have said is not anyway specific to, or distinguishing features of LLMs.

-- anunAda

Karthik

unread,
Sep 16, 2025, 5:44:23 AMSep 16
to sanskrit-programmers
True. That is why my original comment was about digital computing and representation of values. But "this is why you cannot teach LLMs Sanskrit (or any language) directly." probably did not get the point across.

If you have a good idea about how modern computers operate at the signal level, the question about skipping tokens and teaching computers/LLMs Sanskrit directly does not even arise.

I thought I should expand upon what I said earlier so that there is no longer any doubt as to what I meant and why something like this cannot be cannot be done.

Anunad Singh

unread,
Sep 16, 2025, 5:55:15 AMSep 16
to sanskrit-p...@googlegroups.com
I don't understand what you mean when you say  "this is why you cannot teach LLMs Sanskrit (or any language) directly."

As I understand it, LLMs are different from the traditional 'algorithms' in that it is very easy to 'teach the rules' to LLMs. 'Teaching the rule' includes teaching Sanskrit or any other language also. I also think that it is their 'distinguishing feature' to get 'trained' from 'data sets' or from examples.

-- anunAda

Karthik

unread,
Sep 16, 2025, 6:31:20 AMSep 16
to sanskrit-programmers
This is the OP: "while chatting with gemini about tokenization i was curious what if that not required for AI to do so it should be taught data in sanskrit since langauge is code itself machine dont have to convert it into numbers and directly interpret the data we  input in sanskrit and then it will be more sorted gemini suggested me this groplup as welcoming"

What she is saying is:

- what if tokenization was not necessary to communicate with the "AI" (LLM)
- LLMs should be taught in "Sanskrit," "directly"
- Language is code
- Machines should not convert "language" to numbers
- Instead, they should directly take our data which we will provide in "Sanskrit"

When I say "this is why you cannot teach LLMs Sanskrit (or any language) directly," I am responding to these statements after providing a general idea about why that cannot happen.

---

> it is very easy to 'teach the rules' to LLMs.

It is not easy. I have tried this technique with locally hosted smaller models. They do not operate like prolog where you define a set of rules which they can apply to your problem to come up with a solution.

LLMs are large mimicry engines that predict the next token from their database of relations/weights/parameters based on probability (which setting can be tuned). I read a very insightful comment recently from someone who said that LLMs hallucinate (invent stuff) all the time because they are designed to hallucinate; it is their primary function. It is only that we are okay with the result of their hallucinations most of the time.

In a way, there are very similar to humans. They mimic patterns just like we do.

> it is their 'distinguishing feature' to get 'trained' from 'data sets' or from examples

Yes. But in huge quantities. If you give it one verse from Kalidasa, it won't be able to produce a similar verse. It has to be trained on all of Sanskrit literature and only then can it invent something that maybe looks like something Kalidasa might have written.

When I ask Gemini to:

"Write the story of the three little pigs using a conlang that you have invented. Also provide the grammar of the conlang and the vocabulary as used in the story. Do not make use of existing conlangs."

It produces something similar to Latin and Esperanto because that is what it has repeatedly seen.

Anunad Singh

unread,
Sep 16, 2025, 7:04:39 AMSep 16
to sanskrit-p...@googlegroups.com
That is interesting !

You have understood statements such as   'it should be taught data in sanskrit'  , 'language is code itself' ,  ' then it will be more sorted' etc !!!

-- anunAda

vishal jaiswal

unread,
Sep 16, 2025, 10:36:01 AMSep 16
to sanskrit-p...@googlegroups.com
"LLMs hallucinate (invent stuff) all the time because they are designed to hallucinate; it is their primary function."

Isn't this because they are designed to be helpful at the cost of everything else ?

Balaji R

unread,
Sep 16, 2025, 10:49:11 AMSep 16
to sanskrit-p...@googlegroups.com
Correct. I may not call it as hallucination because the matter or substance could be facts and real evidence too.


LLMs paints its imagination and makes creative things with the matter of loaded facts and evidences. 

On Tue, 16 Sept, 2025, 8:06 pm vishal jaiswal, <tovisha...@gmail.com> wrote:
"LLMs hallucinate (invent stuff) all the time because they are designed to hallucinate; it is their primary function."

Isn't this because they are designed to be helpful at the cost of everything else ?

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.

हरिः व्योम

unread,
Sep 16, 2025, 2:19:49 PMSep 16
to sanskrit-p...@googlegroups.com

Karthik

unread,
Sep 16, 2025, 11:59:16 PMSep 16
to sanskrit-programmers
> Isn't this because they are designed to be helpful at the cost of everything else ?

This helpful assistant persona that understands Q&A follows later in the pipeline. Initially, a completion engine is what is trained on a large dataset so that the input "The man went to the" produces "The man went to the house/hospital/station/office/port/stadium/temple/etc." Later, it is fed millions of Q&A/chat examples using a specific format (each LLM has its own format that you can see on HuggingFace in chat template files). This is where all this:

<|im_start|>system<|im_sep|>You are a helpful assistant<|im_end|><|im_start|>user<|im_sep|>Is a tomato a fruit or a vegetable?<|im_end|><|im_start|>assistant<|im_sep|>

business comes from. As it has seen this pattern "<|im_start|>user<|im_sep|>QUESTION<|im_end|><|im_start|>assistant<|im_sep|>ANSWER<|im_end|>" a million times, when you send in a question, the completion comes in the form of a response. Later, a technique called reinforcement learning from human feedback (RLHF) is used to "steer" it in the right direction: encourage and discourage certain kinds of responses to the questions.

Whatever you do with it, in the end, it is just something that invents stuff on the basis of what it "knows." If its hallucination is useful to you, then you are happy and call it an answer or solution. If it isn't, then you blame it for hallucinating.

> Why not just MD5?


No need to waste 128 bits on every character/word/sub-word when you can do it in 15-20 bits.
Reply all
Reply to author
Forward
0 new messages