NamasteHow do you sort Devanagari text? MS Excel sorting is going wrong. Is there a PHP or VBA script who can make Sanskrit order like in Apte's dictionary?
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Seems to work fine on google spreadsheets - https://docs.google.com/spreadsheet/ccc?key=0Al_QBT-hoqqVdFdOS2pDRm43LWc3LXZEY1N0NV9rWWc&usp=sharing
In the book (and how it should be):
So no, Google Docs are as miserable as MS Office. I wonder only why I'm the only one to notice that. See .pdf page 24, 25, 27, 29.
Is there a PHP or VBA script who can make Sanskrit order like in Apte's dictionary?
My expectation is exactly as that of Mārcis; I've been too lazy to do anything about it, but you *can* create a custom collation in Excel.
The long-term "fix" is to talk to Unicode about Sanskrit collations. For a start, you can ask on the Unicode Indic mailing list.
I can not ask anything at https://groups.google.com/forum/#!forum/technical-hindi - it's a closed group. Can you, please?
I am not a member there either - I suppose you need to contact the owner, then join the group and then post.
Q: What about collation of Indic language data? Is that just a binary sort?
http://www.unicode.org/faq/indic.html
A: No. Collation order is not the same as code point order. A good treatment of some issues specific to collation in Indic languages can be found in the paper Issues in Indic Language Collation by Cathy Wissink.
Collation in general must proceed at the level of language or language variant, not at the script or codepoint levels. See also UTS #10: Unicode Collation Algortihm. Some Indic-specific issues are also discussed in that report.
I came to know about this discussion through Mārcis Gasūns's request to join 'वौज्ञानिक तथा तकनीकी हिन्दी समूह'. I have quickly gone through the messages and would like to say the following:
1) I have participated in some of the discussions on Devanagari collation. I found that there is no standard for collation (as far as I know).
2) Unicode consortium will not help in this regard. They are very clear about it at this page:Q: What about collation of Indic language data? Is that just a binary sort?
http://www.unicode.org/faq/indic.html
A: No. Collation order is not the same as code point order. A good treatment of some issues specific to collation in Indic languages can be found in the paper Issues in Indic Language Collation by Cathy Wissink.
Collation in general must proceed at the level of language or language variant, not at the script or codepoint levels. See also UTS #10: Unicode Collation Algortihm. Some Indic-specific issues are also discussed in that report.
(3) OpenOffice and Micrsoft office have a default collation (the unicode collation algorithm?) for collating Devanagari and other Indian scripts. And they have provision to sort according to a specfied order too.
(4) I designed a Devanagri sorting program in javascript in 2007. In this, I have followed a sorting order which I thought was appropriate.
(5) This program can be modified to follow a clearly specified sorting order.
So I request the pundits here discuss and to provide one/two/three sorting orders for Devanagari first. Then I will try to modify my program accordingly.
(5) This program can be modified to follow a clearly specified sorting order.
So I request the pundits here discuss and to provide one/two/three sorting orders for Devanagari first. Then I will try to modify my program accordingly.
I take that back -That would not order निशा and निश्चय correctly.But, essentially, the ordering is same as unicode with the following exception: virAma appears before any vowel.
But then, looking at the flickr images Marcis sent, I just realized that he would want निशा to appear before निश्चय। So the previous algorithm should work.
https://docs.google.com/document/d/1FUWZ7I6uezp-_HPQqtn_rcHsuQ0KMWKkJ2Nj0WW2EOQ/edit#bookmark=id.wp2yexm239p4 is what I want.
Seems like the algorithm I sent should work. Does it not?
सनेमि sanemi
सन्त् sant
सनेमि sanemi
सन्त् sant
संतत /saṅtata/ (pp. от संतन् ) 1) связанный 2) непрерывный, постоянный
संतापन /saṅtāpana/ 1) мучающий 2) ноющий, болящий
संदर्प /saṅdarpa/ m. 1) задор 2) озорство, шалость 3) высокомерие, заносчивость 4) упрямство, своенравие
संध्यासमय /saṅdhyā-samaya/ m. см. संध्याकाल
On Monday, 23 September 2013 00:00:26 UTC+4, विश्वासो वासुकिजः wrote:
Seems like the algorithm I sent should work. Does it not?
Sent? Where?My VBA script still gets
सनेमि sanemi
सन्त् sant
- सन्न sanna
Instead of
सनेमि sanemi
सन्त् sant
संतत /saṅtata/ (pp. от संतन् ) 1) связанный 2) непрерывный, постоянный
संतापन /saṅtāpana/ 1) мучающий 2) ноющий, болящий
संदर्प /saṅdarpa/ m. 1) задор 2) озорство, шалость 3) высокомерие, заносчивость 4) упрямство, своенравие
संध्यासमय /saṅdhyā-samaya/ m. см. संध्याकाल
- सन्न /sanna/ pp. от सद्
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
I did not study your VBA code but following may be the problem:
In microsoft word I found that if you search for, for example, क in text कमल क्लेश किम कोमल , only the first क in कमल is found, not others .
something similar you are trying to do which is not happening as expected.
Not code, just an algorithm described in english.
Whoa - this is complicated.
On Monday, 23 September 2013 13:46:06 UTC+4, Anunad Singh wrote:I did not study your VBA code but following may be the problem:In microsoft word I found that if you search for, for example, क in text कमल क्लेश किम कोमल , only the first क in कमल is found, not others .
This is as expected.
In the above discussion, I wanted to say "I am finding it difficult to comprehend the central point of the discussion."
I am finding to comprehend the central point of the discussion. Can somebody remention the required sorting order in eight-ten lines?
On Tue, Sep 24, 2013 at 1:08 AM, Anunad Singh <anu...@gmail.com> wrote:
I am finding to comprehend the central point of the discussion. Can somebody remention the required sorting order in eight-ten lines?
अनुनाद - अत्र ईक्षताम् - https://docs.google.com/document/d/1imUVqdem21bTjbeXI300JxDntfYu1jgbWWV2N5Q3Qkc/edit । ( Everyone with the link has edit rights - so marcis can modify appropriately. )
--
--
Vishvas /विश्वासः
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
http://mathworld.wolfram.com/LexicographicOrder.html
On Wednesday, 25 September 2013 13:15:02 UTC+4, Anunad Singh wrote:
धन्यवादं विश्वास ।
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
-- अहं तु एतद् एव पश्यामि त्वत्सञ्चिकायाः उद्घाटनेन…
On Thu, Sep 26, 2013 at 9:41 PM, Anunad Singh <anu...@gmail.com> wrote:
--
--
Vishvas /विश्वासः
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
हरङ्गतःहरचितिःहरञ्चितःहरञ्चितःहरंगतःहरंचितःहरांशुः
हरङ्गतःहरचितिःहरञ्चितःहरञ्चितःहरंगतःहरंचितःहरांशुः
Anusvara sorting still wrong
It’s now in file:
समेधन
समोकस्
सम्°
सम्पच्
Should be as in book:
समेधन
समोकस्
संपच्
यदि उपर्युक्त शब्दाः निम्नलिखित प्रकारेण शाटिताः, किम इदम् स्वीकार्यः?
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Do you mean 'is computer program (or algorithm) for sorting Sanskrit words in retrograde order possible?'
If so, I have a program which can do that.
I realize I'm late to the party here, but I only just discovered this thread (via a link to Gasūns' blog posted on the technical-hindi group).
There have been several threads about Devanagari sorting on technical-hindi recently. The requirement that dead consonants be sorted before live consonants (e.g. हल् < हल) has been brought up, as well as the requirement that an anusvāra before a vārgika akṣara be treated as a pañcamakṣara. The requirement that a visarga before a sibilant be treated as the same sibilant is new to me, but isn't fundamentally different from the anusvāra requirement.
(While constructing `decomposition`, make sure to replace any anusvāras followed by a vārgika letter with the corresponding pañcamākṣara, and any visarga followed by a sibilant with the same sibilant.)
After constructing `decomposed_words`, sort it by comparing the `decomposition` field of each tuple in the array. (Nothing fancy here, just standard lexicographical comparison.) Called the sorted array `sorted_decomposed_words`.
Now, construct an new array `sorted_words`. Loop through the elements of `sorted_decomposed_words`, and for each element `(word_index, decomposition)` add `words[word_index]` to `sorted_words`.
And you're done. That's all you need for sorting classical Sanskrit.
(General-purpose Devanagari sorting is much more complicated.)
For retrograde sorting, there's only one change you need to make: when comparing two `decomposition`s, compare right-to-left instead of left-to-right.
I am planning to implement this; I should have a Javascript implementation up and running by the end of next week.
Sorry, I was under the impression that your system still had unresolved
issues.
I will be building a javascript implementation anyway, because, as I
said earlier, my aim is to sort Devanāgarī in general, not just Sanskrit.
I plan to build a VBA version (of the generalized algorithm) after the
Javascript version has been validated by the users of technical-hindi. I
will make sure to post it to this group when it's done
(As an aside: It is, of course, gratifying to know that Huet agrees with
my approach to Devanāgarī sorting.)
NamasteHow do you sort Devanagari text? MS Excel sorting is going wrong. Is there a PHP or VBA script who can make Sanskrit order like in Apte's dictionary?
Why not sort simple Devanagari text using English text sorter?
1- Copy Devanagari text and convert to ITRANS Roman
2-Sort using above link
3-Convert sorted text back to Devanagari
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
I think there is a little Sanskrit sorting happening in this paper. I could not look into it and give the gist. My apologies.
Anyway, the Javascript version is done:
http://anubhav-chattoraj.github.io/indic-tools/devanagari_sorter/
(I'd like to add support for languages that are only occasionally
written in Devanagari, such as Sindhi, but they tend not to have fixed
collation orders.)
Anyway, the Javascript version is done:
http://anubhav-chattoraj.github.io/indic-tools/devanagari_sorter/
For Sanskrit, the only options you'll need to change are:
"Treat anusvāra before vārgika akṣara as" : "pancamākṣara"
"Treat visarga before sibilant as" : "sibilant"
Do try it out, and let me know if you find any bugs.
That file is hard to navigate. It's hard to understand what part is relevant to what program, or which problems are still outstanding, which are resolved, and which are unsolvable.
क+ठ
क+ण
कर्ण-भु+षण
कार्षि(न्)
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-programmers+unsub...@googlegroups.com.
On 10/26/2013 11:00 AM, dhaval patel wrote:Thanks for the pointers.
> Let me jot down a few.
1. Done. The page now has a text field to input the characters to ignore.
2. Is there an accepted order for sorting avagraha and Om? I'm sorting avagraha as अ and ॐ right before अ.
3. I don't follow. The order नि - निःक्षिप् - निःशिष् - निःसह can be obtained simply by sorting in Unicode order, without making any special adjustments for the visarga. (Even after adjustments, the order is the same: नि - निःक्षिप - निश्शिष् - निस्सह.)
4. Already implemented.
5. Already implemented. Non-Devanagari characters as a whole are sorted after Devanagari characters, and their relative order is determined by the default string sort algorithm (whatever it is). Characters to be ignored can be entered in the "ignore" text field.
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-programmers+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Is there a list of such transformations? I looked through a list of visarga-sandhis: http://learnsanskrit.org/references/sandhi/visarga
But it mentions nothing about visarga before ka = ṣ, or even visarga before ś, ṣ, s = ś, ṣ, s.
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-programmers+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
I must confess I love what I see.I sorted:
- हैमीकृ 8.Ā.
- होमय् 10.P.
- ह्रस् 1.Ā.
- ह्रासय् 10.P.
- ह्री 3.P.
- ह्लादय् 10.Ā.
- ह्लाद् 1.Ā.
- ह्वा 4.Ā.
Descending, and got:
- ह्वा 4.Ā.
- ह्लादय् 10.Ā.
- ह्लाद् 1.Ā.
- ह्री 3.P.
- ह्रासय् 10.P.
- ह्रस् 1.Ā.
- होमय् 10.P.
- हैमीकृ 8.Ā.
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
At best, I can add an option to ignore all non-Devanagari characters.
Done. The sorter now has a checkbox labelled "Ignore numbers and Roman
characters".
This should allow reverse sorting even when the line has a pada
identifier at the end.
(Link for convenience:
http://anubhav-chattoraj.github.io/indic-tools/devanagari_sorter/ )
Fixed, the list now sorts in the correct retrograde+descending order:
क ख < क घ < कक < कग
and by "letter by letter", you mean
कक < क ख < कग < क घ
i.e. spaces are ignored when sorting "letter-by-letter".
If so, it sorts word-by-word by default. You can sort letter-by-letter
by adding a space to the ignore field.
1) Above अ, which looks fishy. aṃ, āḥ, oḥ in reverse sorting above "a". Does not looks correct to me. https://www.dropbox.com/s/5db5z7u3bi2d52r/devanagarisorted.txt:
ऊंं
सयेफखांं
सारिस्थाखांं
भारिकं
पारिभाषिकं
महैरण्डं
सुतृष्णं
आनिधनं
अभिनिधनं
व्युत्क्रमं
भगनरायं
अधिहस्त्यं
यौथ्यं
प्रवरं
वृत्तिकारं
महद्बिलं
स्थूलनासं
किसं
ह्रस्वगवेधुकां
उच्चां
पुत्रीकरणमीमांसां
षडिडः
पृत्सुधः
व्यथ्ययः
शैलवालुकाः
गोपद्रुमलताः
अनूराधाः
आञ्जनाभ्यञ्जनाः
प्रहितोः
--
Doesn't look right to me either. ं and ः should sort between औ and क IMO.
My program sorts your list with भारिकं … प्रहितोः between होहौ and अक्.
I've seen Hindi sorting orders that place ं before अ, but this seems to have started out as an implementation mistake which some government bodies have since legitimized.
If you even see one and scan a page, I’ll be thankful. Never thought it could exist on paper.
Not on paper, but see here (scroll down to page 57). TDIL, run by the Government of India, presents ँ < ं < अ as an “order pertinent to sorting by a computer program”.
Also, in Prabhat Prakashan’s “Brihat Hindi Shabdakosh”, the author complains about improper sorting in other dictionaries. Among other things, he accuses some (unnamed) dictionaries of using an अंकारादिक्रम.
I see the last change was made 12 days ago and still to VBA sorter. Oh, Anubhav, hurry up :) The Sanskrit community is looking at you, following each your step.
The Sanskrit community is gently reminded that I don’t do this full-time, informed that I have more pressing work to take care of at the moment, and strongly urged to be patient.
When I have any progress to report, I’ll make sure to report it here.
Mimer's method is mentioned on http://samskrtam.ru/sanskrit-sorting-devanagari/.
Here's some additional information..
I plan to build a VBA version (of the generalized algorithm) after the
Javascript version has been validated by the users of technical-hindi. I
will make sure to post it to this group when it's done.
First of all, let me say that I am really sorry for the lack of updates.
I see that it has been eleven whole months since I last worked on this thing. That isn’t because I don’t want to work on it, but because I’ve honestly been too busy to.
What keeps a man busy day and night, weekdays and weekends, for eleven months? Being an actuary, which means working a full-time job and studying for exams on one’s off-time. Very financially rewarding, but doesn’t leave one with any free time.
Unfortunately, I don’t see the situation easing up any time soon. I expect to remain extremely busy at least until the next round of exams in May 2015. But that’s a very optimistic date…The best I can say is that I’ll probably be free enough to work on this after the round of exams after that one, in November 2015.
Of course, I understand that people really don’t want to wait another 13 months for updates, but I really can’t help it. All I can say is that pull requests are welcome; if anyone else wants to work on this, I’ll gladly clean up their code and merge it with mine. (That would take a long time… probably weeks for every pull request. But that’s still much shorter than 13 months.)
The silver lining (if you can call it that) is that once I get back to working on this, it won’t take all that long. All the changes to the Javascript app are just two or three weekends’ worth of work. (The existing app was built over a single weekend.) Porting to VBA would take another 2-3 months.
TL;DR: This is on the back burner for probably the next 13 months (*@#!!!), but it’ll only take a few months after that to add all the features everyone wants.
On 10/20/2013 11:48 PM, Mārcis Gasūns wrote:
> This is interesting. What do you mean by general? Add Hindi and Marathi
> or what else?
Yes, Hindi and Marathi, and other languages that are usually written in
Devanagari.
(I'd like to add support for languages that are only occasionally
written in Devanagari, such as Sindhi, but they tend not to have fixed
collation orders.)
Anyway, the Javascript version is done:
http://anubhav-chattoraj.github.io/indic-tools/devanagari_sorter/
For Sanskrit, the only options you'll need to change are:
"Treat anusvāra before vārgika akṣara as" : "pancamākṣara"
"Treat visarga before sibilant as" : "sibilant"
Do try it out, and let me know if you find any bugs.
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.