Konkani LLM: Multi-Script Instruction Tuning and Evaluation for a Low-Resource Indian Language (RC Fernandes, GS Patkar)

4 views
Skip to first unread message

Frederick Noronha

unread,
Apr 5, 2026, 3:19:55 PM (12 days ago) Apr 5
to goa-rese...@googlegroups.com

Konkani LLM: Multi-Script Instruction Tuning and Evaluation for a Low-Resource Indian Language

Large Language Models (LLMs) consistently under perform in low-resource linguistic contexts such as Konkani. This performance deficit stems from acute training data scarcity compounded by high script diversity across Devanagari, Romi and Kannada orthographies. To address this gap, we introduce Konkani-Instruct-100k, a comprehensive synthetic instruction-tuning dataset generated through Gemini 3.
We establish rigorous baseline benchmarks by evaluating leading open-weights architectures including Llama 3.1, Qwen2.5 and Gemma 3 alongside proprietary closed-source models. Our primary contribution involves the development of Konkani LLM, a series of fine-tuned models optimized for regional nuances. Furthermore, we are developing the Multi-Script Konkani Benchmark to facilitate cross-script linguistic evaluation. In machine translation, Konkani LLM delivers consistent gains over the corresponding base models and is competitive with and in several settings surpasses proprietary baselines  https://arxiv.org/abs/2603.23529


_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/
_/  Frederick Noronha  फ्रेडरिक नोरोन्या  * فريدريك نورونيا‎
_/  AUDIO https://archive.org/details/@fredericknoronha
_/  http://goa1556.in +91-9822122436 784 Saligao Goa
_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/

William Robert Da Silva

unread,
Apr 10, 2026, 6:22:49 AM (7 days ago) Apr 10
to goa-rese...@googlegroups.com
Send some clarification on the entire project.
WRDS

--
You received this message because you are subscribed to the Google Groups "Goa-Research-Net" group.
To unsubscribe from this group and stop receiving emails from it, send an email to goa-research-n...@googlegroups.com.
To view this discussion, visit https://groups.google.com/d/msgid/goa-research-net/CA%2Bmqab8n87oQUuD51v0UMO5e2-o7Kb-cehxiZz7q7JePCndJVA%40mail.gmail.com.

John de Figueiredo

unread,
Apr 10, 2026, 8:33:09 AM (7 days ago) Apr 10
to goa-rese...@googlegroups.com, goa-rese...@googlegroups.com
William,
To understand what they are doing, at a minimum you have to know calculus, linear algebra, and a computer language.
John M. de Figueiredo 
Sent from my iPhone

On Apr 10, 2026, at 6:22 AM, William Robert Da Silva <wrds...@gmail.com> wrote:



Helga Do Rosario Gomes

unread,
Apr 10, 2026, 1:45:48 PM (7 days ago) Apr 10
to goa-rese...@googlegroups.com, goa-rese...@googlegroups.com
Hi William, 
I too would have liked to learn more about this but here’s a simple explanation that I gleaned from my limited knowledge and chatGTP. 
As you are aware the many  AI Large language models (LLMs) that underly chatGTP, Gemini, Claude, DeepSeek etc are trained by ‘supplying’ them with huge amounts of training data like magazines, journals and who know what! This is a highly contentious issue with creators which is why NYT has sued them but that is another story. 
But these LLMs  struggle with languages like Konkani because there isn’t enough training data and the language is written in multiple scripts (like Devanagari, Romi, and Kannada), which makes learning (for these models) harder. To fix this, the authors created a large synthetic dataset  (Konkani-Instruct-100k) and used it to fine-tune existing models such as Llama 3.1 (meta) Qwen2.5 (Ali baba) and Gemma 3 (Google). They also built benchmarks to evaluate performance across scripts. As a result, their improved models perform better at tasks like translation, sometimes even outperforming propriety baselines. 
As Joǎo said if you want to know how they do it then you need expertise in computer sciences and math but as a highly regarded Konkani researcher I think you would be a perfect user to test their fine tuned model.  
Hope this helps.
Best, 
Helga 
Sent from my iPhone

On Apr 10, 2026, at 06:22, William Robert Da Silva <wrds...@gmail.com> wrote:



William Robert Da Silva

unread,
Apr 12, 2026, 8:37:01 PM (5 days ago) Apr 12
to goa-rese...@googlegroups.com
Helga, I have been with oral Konkanni of different caste-occupational groups in Goa state and outside, less with Kerala and Navayat (I had contact with them in the 1970s). Written Konkanni moves away in the morpho-phonological, syntactical and verbal framework of most of these caste-occupational Konkanni. It is use-based in occupation and living and the vocabulary and linguistic issues are of another kind. Put together, they give a dynamic unity to Konkanni because the occupational groups are mutually communicative and highly sharing. My language 'bandavoll' and 'utram-daiz' are original to Konkanni. It does not celebrate 'Goy amchem mullpitt, Konkanni amchi bhas' because in history, Goa was a city, a capital city of different kingdoms ruling in coastal Konkann. (Others would call it Aparaant, in opposition to Puurvaant?). Konkanni people looked at it differently, beginning with the first arrivals from Africa some 40000 years ago, who moved down, eastward, north and into south-east Asia. The Mittgauddo, Kharvi, Gabit, Velip, etc. are good examples of language dynamism.
I have entertained and worked this out all these decades for Konkanni and other Indian languages.
Some thought, stray and away LLMs.
W R Da Silva

Sangeeta Chakrabarty

unread,
Apr 13, 2026, 6:46:54 AM (4 days ago) Apr 13
to goa-rese...@googlegroups.com
Hello William sir and Helga Maam, 

Its wonderful to know about your interest in the Konkani LLM paper.

I can connect you to all the Computer Science Professors including Prof. Gaurang Patkar working on Konkani currently and I am sure they will also gain some more insights from your discussions. 

Happy to connect! 



Prof. Sangeeta Chakrabarty, 
HoD, Dept. of I.T,
S. S. Dempo College (Autonomous),
Integrated Educational Complex, 
Cujira, (opp GMC, Bambolim) Ilhas, 
Goa 403 202

KNOW about me at:
My Papers at:

William Robert Da Silva

unread,
Apr 14, 2026, 8:02:38 AM (3 days ago) Apr 14
to goa-rese...@googlegroups.com
Most of my work in Konkanni had been in the field with multiple caste occupational groups, not the written, Marathi-dominated, Konkanni of Brahman communities, most of which in Goa are fish eaters and traders, as I worked with Mamai Kamat and family for a long time on their history, with extension into Timmayya or Timoja Naik of Vijayanagar naval guard in Karwar etc. What sort of Konkanni structure you perpetuate, I do not know still and it might have a continuity with the written, Marathi-influenced Kokanni of Bamonn Gaunvkari.
We could discuss these issues if you like and plunge in.
William Robert

Reply all
Reply to author
Forward
0 new messages