New version (2.0) of Hindi Hunspell Dictionary

ShreeDevi Kumar

unread,

Mar 11, 2017, 8:26:06 AM3/11/17

to technic...@googlegroups.com, Sanskrit Team_member, VishvAs VAsuki

I have updated the Hindi hunspell dictionary which can be used with notepad++ as well as Open Office , Libre Office etc.

This time I have relied more on using 'rules' for generating words for checking spelling rather than using a word frequency lists as base.

This has resulted in a smaller size for the dictionary - 304 kb for the zipped version.

I will appreciate if group members can give it a try and provide feedback.

zip file is available from https://github.com/Shreeshrii/hindi-hunspell/blob/master/dict-hi_IN.zip

ShreeDevi Kumar

unread,

Mar 11, 2017, 8:27:42 AM3/11/17

to technic...@googlegroups.com, Sanskrit Team_member, VishvAs VAsuki

For those who are interested, the rules are defined in

https://github.com/Shreeshrii/hindi-hunspell/blob/master/dict-hi_IN/hi_IN.aff

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

रवि-रतलामी

unread,

Mar 12, 2017, 4:57:55 AM3/12/17

to Scientific and Technical Hindi (वैज्ञानिक तथा तकनीकी हिन्दी), sans...@cheerful.com, vishvas...@gmail.com

बढ़िया.

मैंने नोटपैड ++ पर इसे इंस्टाल करने की कोशिश की, परंतु कहीं मामला अटक गया. नोटपैड++ पर बाई डिफ़ॉल्ट Aspell स्पेलचेकर चलता है. hunspell डिक्शनरी तथा यह हिंदी का शब्दकोश इंस्टाल करने का तरीका (विंडोज 10 पर) यदि आप बता सकें तो कृपा होगी.

सादर,

रवि

शनिवार, 11 मार्च 2017 को 6:56:06 अपर UTC+5:30 को, shree ने लिखा:

ShreeDevi Kumar

unread,

Mar 12, 2017, 5:21:32 AM3/12/17

to technic...@googlegroups.com, Sanskrit Team_member, VishvAs VAsuki

I use dspellcheck plugin with notepad++ on windows10.

The dictionary files are kept in C:\Users\User\AppData\Roaming\Notepad++\plugins\config\Hunspell

In case you are already using hi_IN.dic and hi_IN.aff, please rename the files so that you can compare/go back, or have both versions available with different names.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
आपको यह संदश इसलिए मिला है क्योंकि आपने Google समूह के "Scientific and Technical Hindi (वैज्ञानिक तथा तकनीकी हिन्दी)" समूह की सदस्यता ली है.
इस समूह की सदस्यता समाप्त करने और इससे ईमेल प्राप्त करना बंद करने के लिए, technical-hindi+unsubscribe@googlegroups.com को ईमेल भेजें.
अधिक विकल्पों के लिए, https://groups.google.com/d/optout में जाएं.

ShreeDevi Kumar

unread,

Mar 12, 2017, 5:37:04 AM3/12/17

to technic...@googlegroups.com, Sanskrit Team_member, VishvAs VAsuki

I will look at the old discussion on https://groups.google.com/forum/#!topic/technical-hindi/giCi2_oJ_3c and check too.

One question I have is about nukta. Hariramji had said to use the combined letter but the following seem to indicate the opposite.

As per unicode and http://rishida.net/scripts/block/devanagari

"Do not use the precomposed U+095D DEVANAGARI LETTER RHA ढ़, since normalization form NFC uses the decomposed sequence."

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

रवि-रतलामी

unread,

Mar 13, 2017, 8:01:18 AM3/13/17

to Scientific and Technical Hindi (वैज्ञानिक तथा तकनीकी हिन्दी), sans...@cheerful.com, vishvas...@gmail.com

बहुत-2 धन्यवाद. यह तो बहुत बढ़िया है. तेज और सटीक.

इस समूह में यदि किसी सदस्य(यों) के पास वर्तनी जांच उपयुक्त शब्द-कोश का संग्रह है (डिक्शनरी फ़ॉर्मेट में, एक लाइन में एक शब्द) तो कृपया उसे भी साझा करें. ताकि इसे और बढ़िया बनाया जा सके.

सादर,

रवि

रविवार, 12 मार्च 2017 को 2:51:32 अपर UTC+5:30 को, shree ने लिखा:

रवि-रतलामी

unread,

Mar 13, 2017, 8:12:43 AM3/13/17

to Scientific and Technical Hindi (वैज्ञानिक तथा तकनीकी हिन्दी)

एक समस्या है - अंग्रेज़ी के फुलस्टाप में तो कोई समस्या नहीं है, मगर हिंदी के पूर्णविराम वाले अक्षरों को यह गलत समझ रहा है. कृपया देखें.

सादर,

रवि

सोमवार, 13 मार्च 2017 को 5:31:18 अपर UTC+5:30 को, रवि-रतलामी ने लिखा:

रवि-रतलामी

unread,

Mar 13, 2017, 8:29:52 AM3/13/17

to Scientific and Technical Hindi (वैज्ञानिक तथा तकनीकी हिन्दी)

aff फ़ाइल में यह डिफ़ाइन करने से समस्या का समाधान हो गया -

ICONV । .

सादर,

रवि

सोमवार, 13 मार्च 2017 को 5:42:43 अपर UTC+5:30 को, रवि-रतलामी ने लिखा:

ShreeDevi Kumar

unread,

Mar 13, 2017, 8:43:11 AM3/13/17

to technic...@googlegroups.com

Thanks. I have added to .aff file.

In notepad++ it seemed ok. But I also have the settings as per

http://sanskritdocuments.org/hindi/hunspell.html

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

ShreeDevi Kumar

unread,

Mar 13, 2017, 8:46:32 AM3/13/17

to technic...@googlegroups.com

Please see the files in https://github.com/Shreeshrii/hindi-hunspell/tree/master/Hindi

It will be helpful, if you can review the files eg. hi-verbs.dic to see if any verb has the wrong rule applied or if it is missing any forms.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

V S Rawat

unread,

Mar 13, 2017, 10:16:21 PM3/13/17

to technic...@googlegroups.com

तो फिर ये लाइन भी जोड़ दीजिए -
ICONV ॥ .

> <mailto:ravir...@gmail.com>>:

> के लिए, technical-hin...@googlegroups.com
> <mailto:technical-hin...@googlegroups.com> को

> ईमेल भेजें.
> अधिक विकल्पों के लिए, https://groups.google.com/d/optout
> <https://groups.google.com/d/optout> में जाएं.
>
>

> --
> आपको यह संदश इसलिए मिला है क्योंकि आपने Google समूह के "Scientific and
> Technical Hindi (वैज्ञानिक तथा तकनीकी हिन्दी)" समूह की सदस्यता ली है.
> इस समूह की सदस्यता समाप्त करने और इससे ईमेल प्राप्त करना बंद करने के लिए,

> technical-hin...@googlegroups.com
> <mailto:technical-hin...@googlegroups.com> को ईमेल भेजें.

हरिराम

unread,

Mar 15, 2017, 2:19:44 AM3/15/17

to technic...@googlegroups.com

While woking in MS word, When we use

ड+़ instead of ड़

0921+093C instead of 095C

The In-built editor break the work at nukta, and treat as two separate words. As nukta is in category of punctuation marks.

i.e. गुड़िया is treated as गुड+़ िया

which creates much problems while using translation tools, NLP.

While using double click for selection of a word, the Nukta will break the word in two.

While using sorting and indexing also Nukta creates problems,

Therefore Unicode encoded separate Characters with Nukta (inbuilt)

क़ ख़ ग़ ज़ ड़ ढ़ फ़ etc.

0958 0959 095A 095B 095C 095D 095E etc.

So correct spelling should use these, instread of using base-char+Nukta.

If necessary, the normalisation rules needs to be changed.

2017-03-12 15:06 GMT+05:30 ShreeDevi Kumar <shree...@gmail.com>:

One question I have is about nukta. Hariramji had said to use the combined letter but the following seem to indicate the opposite.

As per unicode and http://rishida.net/scripts/block/devanagari

"Do not use the precomposed U+095D DEVANAGARI LETTER RHA ढ़, since normalization form NFC uses the decomposed sequence."

हरिराम
प्रगत भारत <http://hariraama.blogspot.in>

हरिराम

unread,

Mar 15, 2017, 2:29:53 AM3/15/17

to technic...@googlegroups.com

Adobe InDesign CS6, have the ability to create documents using Indic Unicode text. Adobe World-Ready Composer (WRC) provides correct word shaping for many of the non-Western scripts, such as Devanagari. Adobe World-Ready composers in the International English version of InDesign, support several indian languages including Hindi, Marathi, Gujarati, Tamil, Punjabi, Bengali, Telugu, Oriya, Malayalam, and Kannada.

Hunspell spelling and hyphenation dictionaries are included in InDesign CS6, and so is the Adobe Devanagari font family.

अतः यह समय की मांग है कि इस स्पेल चेकर प्रोग्राम को Adobe InDesign CS6 और Adobe InDesign CC7 में भी आजमाया जाए।

यदि यह ठीक चला तो प्रकाशन उद्योग को बहुत बड़ी मदद मिलेगी।

ShreeDevi Kumar

unread,

Mar 15, 2017, 2:39:19 AM3/15/17

to technic...@googlegroups.com

Regarding nukta, if this is the desired way to go ahead, I can change the dictionary and rules.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--

हरिराम

unread,

Mar 15, 2017, 2:50:43 AM3/15/17

to technic...@googlegroups.com

It will be better,

If all the +Nukta syllables will be auto-replaced with Nukta-built-in Characters.

2017-03-15 12:08 GMT+05:30 ShreeDevi Kumar <shree...@gmail.com>:

Regarding nukta, if this is the desired way to go ahead, I can change the dictionary and rules.

ShreeDevi

V S Rawat

unread,

Mar 15, 2017, 6:10:19 AM3/15/17

to technic...@googlegroups.com

मैंने तो एमएस वर्ड में ऐसा कुछ नहीं देखा है।

बल्कि एमएस वर्ड तो सभी मात्राओं संयुक्ताक्षरों को एक सिंगल इकाई के रूप
मे मानता और व्यवहार करता है।

जो चिह्न किसी अन्य वर्ण के साथ मिल जाते हैं (जुड़ जाते हैं), वर्ड उन्हें
एक एकल इकाई मानता है।

अगर वर्ड में गु ड ़ को एक साथ करके गुड़ लिखा गया है, तो आप कर्सर से ड और
़ को अलग-अलग नहीं कर सकते हैं, न ही वर्ड इन्हें किसी भी अन्य तरह से ब्रेक करता है।

रावत

On 3/15/2017 11:49 AM, हरिराम wrote:
> While woking in MS word, When we use
>
> ड+़ instead of ड़
>
> 0921+093C instead of 095C
>
>
> The In-built editor break the work at nukta, and treat as two separate
> words. As nukta is in category of punctuation marks.
>
>
> i.e. गुड़िया is treated as गुड+़ िया
>
>
> which creates much problems while using translation tools, NLP.
>
>
> While using double click for selection of a word, the Nukta will break
> the word in two.
>
>
> While using sorting and indexing also Nukta creates problems,
>
>
> Therefore Unicode encoded separate Characters with Nukta (inbuilt)
>
>
> क़ ख़ ग़ ज़ ड़ ढ़ फ़ etc.
>
> 0958 0959 095A 095B 095C 095D 095E etc.
>
>
> So correct spelling should use these, instread of using base-char+Nukta.
>
>
> If necessary, the normalisation rules needs to be changed.
>
>
>
> 2017-03-12 15:06 GMT+05:30 ShreeDevi Kumar <shree...@gmail.com

> <mailto:shree...@gmail.com>>:

>
> One question I have is about nukta. Hariramji had said to use the
> combined letter but the following seem to indicate the opposite.
>
> As per unicode and http://rishida.net/scripts/block/devanagari
> <http://rishida.net/scripts/block/devanagari>
>
> "Do not use the precomposed U+095D DEVANAGARI LETTER RHA ढ़

> <http://rishida.net/scripts/block/devanagari#char095D>, since

> normalization form NFC uses the decomposed sequence."
>
>
>
>
> हरिराम
> प्रगत भारत <http://hariraama.blogspot.in>
>

> --
> आपको यह संदश इसलिए मिला है क्योंकि आपने Google समूह के "Scientific and
> Technical Hindi (वैज्ञानिक तथा तकनीकी हिन्दी)" समूह की सदस्यता ली है.
> इस समूह की सदस्यता समाप्त करने और इससे ईमेल प्राप्त करना बंद करने के लिए,

> technical-hin...@googlegroups.com
> <mailto:technical-hin...@googlegroups.com> को ईमेल भेजें.

हरिराम

unread,

Mar 15, 2017, 8:01:23 AM3/15/17

to technic...@googlegroups.com

रावत जी,

आप MS word के किस संस्करण का उपयोग करते हैं?

यदि आप 2007 या इसके बाद के संस्करण का उपयोग करते हैं तो, वर्ड में कृपया "गुड़हल" जैसे शब्द लिखें (ड+़) लगाकर

फिर "गुड़हल" को डबल-क्लिक करके सेलेक्ट करें। केवल "गुड़" या केवल "हल" ही सेलेक्ट होगा। पूरा शब्द "गुड़हल" एकबारगी एक डबल-क्लिक से सेलेक्ट नहीं होगा।

शायद आप एम एस वर्ड में किसी CAT TOOL या किसी NLP प्रोग्राम का उपयोग नहीं करते होंगे। कृपया प्रयास करें, तो समस्या का स्वयं अनुभव कर सकते हैं।

निम्न स्क्रीनशॉट के अनुसार Word options में use sequence checking को off रखकर वर्ड में टाइप करें तो Editing सरल होगी, केवल मात्राओं को भी टाइप करने पर प्रकट होगी।

सादर।

2017-03-15 15:40 GMT+05:30 V S Rawat <vsr...@gmail.com>:

मैंने तो एमएस वर्ड में ऐसा कुछ नहीं देखा है।

बल्कि एमएस वर्ड तो सभी मात्राओं संयुक्ताक्षरों को एक सिंगल इकाई के रूप मे मानता और व्यवहार करता है।

जो चिह्न किसी अन्य वर्ण के साथ मिल जाते हैं (जुड़ जाते हैं), वर्ड उन्हें एक एकल इकाई मानता है।

अगर वर्ड में गु ड ़ को एक साथ करके गुड़ लिखा गया है, तो आप कर्सर से ड और ़ को अलग-अलग नहीं कर सकते हैं, न ही वर्ड इन्हें किसी भी अन्य तरह से ब्रेक करता है।

रावत

हरिराम

unread,

Mar 15, 2017, 8:07:43 AM3/15/17

to technic...@googlegroups.com

केन्द्रीय हिन्दी निदेशालय द्वारा 2016 में प्रकाशित उपरोक्त विषयक नवीनतम मानक पुस्तिका यहाँ डाउनलोड हेतु उपलब्ध हुई है:

http://hindinideshalaya.nic.in/hindi/schemeofpublication/FinalDevnagriLipi_05-07-2016.pdf

इसके अनुसार स्पेल चेकर के नियमों को अद्यतन किया जाना चाहिए।

परंतु क्या यह, BIS IS 16500:2012 मानकों को supersede करते हैं?

क्या इसे ही सरकारी स्तर पर अधिकृत नवीनतम मानक माना जाए? इस बारे में स्पष्टीकरण अपेक्षित है।

2017-03-15 12:08 GMT+05:30 ShreeDevi Kumar <shree...@gmail.com>:

.... dictionary and rules.

V S Rawat

unread,

Mar 15, 2017, 9:45:59 AM3/15/17

to technic...@googlegroups.com

मैंने वर्ड 2010 और 2016 का उपयोग किया है।

मैंने आपके लिखे इस गुड़हल को 2016 में कॉपी किया। डबल क्लिक करने पर यह
पूरा शब्द एक साथ सिलेक्ट कर रहा है।

आपके कहे तरीके को मैंने कभी महसूस नहीं किया, इतने सालों में।

मैंने "use sequence checking" और साथ ही "Type and Replace" को ऑन कर रखा है।

धन्यवाद
रावत

On 3/15/2017 5:31 PM, हरिराम wrote:
> रावत जी,
>
> आप MS word के किस संस्करण का उपयोग करते हैं?
>
> यदि आप 2007 या इसके बाद के संस्करण का उपयोग करते हैं तो, वर्ड में कृपया "गुड़हल" जैसे
> शब्द लिखें (ड+़) लगाकर
>
> फिर "गुड़हल" को डबल-क्लिक करके सेलेक्ट करें। केवल "गुड़" या केवल "हल" ही सेलेक्ट होगा।
> पूरा शब्द "गुड़हल" एकबारगी एक डबल-क्लिक से सेलेक्ट नहीं होगा।
>
> शायद आप एम एस वर्ड में किसी CAT TOOL या किसी NLP प्रोग्राम का उपयोग नहीं करते
> होंगे। कृपया प्रयास करें, तो समस्या का स्वयं अनुभव कर सकते हैं।
>
> निम्न स्क्रीनशॉट के अनुसार Word options में use sequence checking को off रखकर
> वर्ड में टाइप करें तो Editing सरल होगी, केवल मात्राओं को भी टाइप करने पर प्रकट होगी।
>
> सादर।
>

> Inline image 1
>
> 2017-03-15 15:40 GMT+05:30 V S Rawat <vsr...@gmail.com

> <mailto:vsr...@gmail.com>>:

>
> मैंने तो एमएस वर्ड में ऐसा कुछ नहीं देखा है।
>
> बल्कि एमएस वर्ड तो सभी मात्राओं संयुक्ताक्षरों को एक सिंगल इकाई के रूप मे मानता
> और व्यवहार करता है।
>
> जो चिह्न किसी अन्य वर्ण के साथ मिल जाते हैं (जुड़ जाते हैं), वर्ड उन्हें एक एकल इकाई
> मानता है।
>
> अगर वर्ड में गु ड ़ को एक साथ करके गुड़ लिखा गया है, तो आप कर्सर से ड और ़ को
> अलग-अलग नहीं कर सकते हैं, न ही वर्ड इन्हें किसी भी अन्य तरह से ब्रेक करता है।
>
> रावत
>
>
>
>
> हरिराम
> प्रगत भारत <http://hariraama.blogspot.in>
>

हरिराम

unread,

Mar 16, 2017, 1:10:08 AM3/16/17

to technic...@googlegroups.com

रावत जी,

आप फोनेटिक इनपुट के जरिए हिंदी टाइप करते होंगे या गूगल ट्रांसलिट के द्वारा। यह शायद ड+़ (0921+093C) को ड़ (095C) में बदल देता होगा।

"use sequence checking" को off करके try करें।

2017-03-15 19:15 GMT+05:30 V S Rawat <vsr...@gmail.com>:

आपके कहे तरीके को मैंने कभी महसूस नहीं किया, इतने सालों में।

मैंने "use sequence checking" और साथ ही "Type and Replace" को ऑन कर रखा है।

V S Rawat

unread,

Mar 16, 2017, 1:59:39 AM3/16/17

to technic...@googlegroups.com

1. जी, मैं अपने shusha_uni कीबोर्ड लेआउट से टाइप करता हूँ, जिससे नुक्ता
अलग से नहीं, बल्कि नुक्ते वाला यूनीकोड वर्ड ही टाइप होता है।

2. लेकिन गूगल ट्रान्सलेट नुक्ता को अलग से देता है, नुक्ता वाला वर्ण नहीं
देता है। यह बहुत बड़ी समस्या है।

जब गूगल ट्रान्सलेट का आउटपुट मिलता है तो पहले मुझे वर्ण+नुक्ता को नुक्ता
वाला वर्ण से बदलना पड़ता है, वरना मेरी फ़ाइलों में सॉर्टिंग की समस्या हो
जाती है, और बाद को सर्च रिप्लेस में भी समस्या होती है।

3. मैं समझा नहीं।

> "use sequence checking" को off करके try करें।

जब इसे ऑन करके रखने पर काम सही होता है, तो आप इसे ऑफ़ करके क्यों रखते हैं?

धन्यवाद।
रावत

On 3/16/2017 10:40 AM, हरिराम wrote:
> रावत जी,
>
> आप फोनेटिक इनपुट के जरिए हिंदी टाइप करते होंगे या गूगल ट्रांसलिट के द्वारा। यह शायद
> ड+़ (0921+093C) को ड़ (095C) में बदल देता होगा।
>
> "use sequence checking" को off करके try करें।
>
>
> 2017-03-15 19:15 GMT+05:30 V S Rawat <vsr...@gmail.com

> <mailto:vsr...@gmail.com>>:

>
> आपके कहे तरीके को मैंने कभी महसूस नहीं किया, इतने सालों में।
>
> मैंने "use sequence checking" और साथ ही "Type and Replace" को ऑन कर रखा है।
>
>
>
>
> हरिराम
> प्रगत भारत <http://hariraama.blogspot.in>
>

ShreeDevi Kumar

unread,

Mar 16, 2017, 3:35:10 AM3/16/17

to technic...@googlegroups.com

I have updated the files for nukta, also added some more words.

Please test with new version of files from https://github.com/Shreeshrii/hindi-hunspell

https://github.com/Shreeshrii/hindi-hunspell/blob/master/dict-hi_IN.zip

https://github.com/Shreeshrii/hindi-hunspell/blob/master/hi_spellchecker_OOo3.oxt

https://github.com/Shreeshrii/hindi-hunspell/blob/master/Hindi/CHANGELOG

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--

हरिराम

unread,

Mar 16, 2017, 6:46:01 AM3/16/17

to technic...@googlegroups.com

मुझे अक्सर भारतीय लिपियों के तकनीकी दस्तावेज प्रस्तुत करने पड़ते हैं।

उदाहरण के लिए -- कभी कभी बिना व्यंजन के केवल मात्राओं को टाइप करके प्रकट करने की जरूरत पड़ती है।

कभी गलत प्रयोग को दर्शाने के लिए मात्राओं व संयुक्त वर्णों, syllables, आदि के गलत प्रयोग भी दर्शाना पड़ता है।

कभी कभी find and replace में wildcard का उपयोग करते हुए केवल मात्राओं या केवल वर्णों को global change करना पड़ता है।

use sequence checking ऑन रहने पर ये सब संभव नहीं होता। पूरा syllables को एक ईकाई मानकर वर्ड चलता है।

अतः इस फीचर को ऑफ रखता हूँ।

2017-03-16 11:29 GMT+05:30 V S Rawat <vsr...@gmail.com>:

. मैं समझा नहीं।
> "use sequence checking" को off करके try करें।

जब इसे ऑन करके रखने पर काम सही होता है, तो आप इसे ऑफ़ करके क्यों रखते हैं?

Reply all

Reply to author

Forward