An approximate-deduplicating-key for sanskrit texts: problem motivation, definition, shared tests

7 views
Skip to first unread message

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Nov 8, 2017, 9:58:07 PM11/8/17
to sanskrit-programmers, dhaval patel
+dhaval, who I know has encountered a similar problem with his dict work.

The same sanskrit text can be found written in varying forms (due to reasons listed later in this email). So, one encounters problems such as one of figuring out how to return texts matching "dharma" when one searches for the equally valid "dharmma" (which some people DO prefer to use).

One solution to this would be to match both the query-string and the corpus-word to the same nearly-unique hash. (Alternate solutions exist, but they're tougher.) Following this approach, one gets a general problem of designing a function described in the following doc:​

* Given some devanAgarI sanskrit text, this function produces a "key" so that
* 1] The key should be the same for different observed orthographical forms of the same text. For example:
* - "dharmma" vs "dharma"
* - "rAmaM gacChati" vs "rAma~N gacChati" vs "rAma~N gacChati"
* - "kurvan eva" vs "kurvanneva"
* 2] The key should be different for different for different texts.
* - "stamba" vs "stambha"
*
* This function attempts to succeed at [1] and [2] *almost* all the time.
* Longer the text, probability of failing at [2] decreases, while probability of failing at [1] increases (albeit very slightly).
*
* Sources of orthographically divergent forms:
* - Phonetically sensible grammar rules
* - Neglect of sandhi while writing
* - Punctuation, spaces, avagraha-s.
* - Regional-language-influenced mistakes (La instead of la.)
*
* Some example applications of this function:
* - Create a database of quotes or words with minimal duplication.
* - Search a database of quotes or words while being robust to optional forms.

I've implemented an initial solution in ​the scala language here.


Implementations in any language can use this shared test set:


--
--
Vishvas /विश्वासः

Reply all
Reply to author
Forward
0 new messages