Split big words/sentences

70 views
Skip to first unread message

Amir Simantov

unread,
Feb 12, 2018, 1:14:32 PM2/12/18
to sanskrit-programmers
Hi guys,

I am aware of a post that was active a year ago called Sandhi splitting tool but I was not sure neither if my case is the same nor whether I should reincarnate an old post.

Here is the issue. I am a webmaster (no Sanskrit knowledge) and in the database of my client he wants to compare versions of texts in manuscripts (this is called "text re-use detection"). However, all the tools that do this detection assume that words are separated by a space (or a comma, etc.). 

So I need a library that I can right on demand on the website that splits big sentences into their Sanskrit small words (I know that there are many option to split, morphologically speaking, so let's say that I will take the most probable option). Only after splitting I can use one of the libraries that do the text re-use detection.

Here is an example of texts to compare:
  • iti śrīsaṃkṣiptavedāntaśāstraprakriyāṃ śrīmatparamahaṃsaparivrājakācāryaśrīmacchaṅkarabhagavataḥ kṛtau bahimukhāntaḥ praṇavamajñānabodhinī adhyātmavidyaupadesavidhi samāpta
  • iti śrīmat paramahaṃsaparivrājakācārya(śrī)machaṅ(ṃ)karakṛtavahimakhātaḥ pravaṇamajñānavodhanī atmavidyopadeśavidhi samāptaḥ
  • śrīmatparamahaṃsaparivrajakācāryaśrīmachaṃkarakṛtapraṇavatrayam ajñānabodhanaṃ adhyātmavidyopadeśavidhiḥ samāptaḥ
  • śrīmatparamahamsaparivrājakācāryaśrīmachakarācāryakṛtabahirmukhāntaḥ pravaṇamajñānabodhinī // adhayātmavidyopadeśavidhiḥ samāptaḥ
(Taken from a work called Ajñānabodhinī).

Does anyone know about something that I can use?

Thank you very much.
Amir

Michael Bykov

unread,
Feb 12, 2018, 1:34:15 PM2/12/18
to sanskrit-programmers
Hi Amir,

that is complex task. In my experiments, now deprecated, (look https://github.com/mbykov/vigraha)

50-symbols word gives up to 150 000 possible chains of possible padas. From which you then have to select the chains consisting of the real words.

Nevertheless, try (based on this routine) a Chromium plugin:


The plugin can process more or less quickly the words up to 25 syms. The longer word you have to split into parts by hand before processing.




 
Amir

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-programmers+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Dhaval Patel

unread,
Feb 12, 2018, 6:24:44 PM2/12/18
to sanskrit-p...@googlegroups.com

Avinash L Varna

unread,
Feb 12, 2018, 7:48:11 PM2/12/18
to sanskrit-programmers
The sanskrit parser library indeed has similar goals as your request (disclaimer: I am one of the contributors). It is currently still under development, but can produce reasonably good results. E.g. for the following input (corresponding to the first sentence you provided):
iti śrīsaṃkṣiptavedāntaśāstraprakriyāṃ śrīmatparamahaṃsaparivrājakācāryaśrīmacchaṅkarabhagavataḥ kṛtau bahimukhāntaḥ praṇavamajñānabodhinī adhyātmavidyopadeśavidhiḥ samāptaḥ

the output in SLP1 encoding is:
[iti, SrI, saMkziptavedAntaSAstraprakriyAm, SrImat, paramahaMsaparivrAjakAcArya, SrImat, SaNkara, Bagavatas, kftO, ba, him, uKAm, tas, praRavam, ajYAnaboDinI, aDyAtmavidyA, upadeSa, viDis, samAptas]

For very long sentences, the splitting can take some time because of the graph path-finding algorithm. The above example took a couple of minutes on my laptop. Whether your web server can afford to spend that much time splitting long sentences is up to you. There are plans to add NN based capabilities in the future that may reduce the computational cost.

Another limitation is that the sanskrit_parser library currently does not handle any errors in the input. E.g. your original input had vidy*au*padesa which had to be corrected to vidy*o*padeśa, bahimukha probably needs to be bahirmukha (which is why there is a weird split in the output), etc.

However, if you are just looking for text reuse detection, would it still work if the result is not semantically/morphologically valid? If so, you might be able to use some consistent sub-unit tokenization algorithms such as sentencepiece (https://github.com/google/sentencepiece) or morphessor (http://morpho.aalto.fi/projects/morpho/). For the former, a model trained on DCS data is available as part of this project - https://github.com/cvikasreddy/skt

You will still have to deal with errors/inconsistencies in the input, and probably some other issues that I haven't thought of :)

Thanks
Avinash



To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-programmers+unsubscrib...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Amir Simantov

unread,
Feb 15, 2018, 7:29:29 AM2/15/18
to sanskrit-p...@googlegroups.com
Hi!
Thank you very much all.
I will check all references.
You are very kind.

Amir


You received this message because you are subscribed to a topic in the Google Groups "sanskrit-programmers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sanskrit-programmers/MEGnEfTGKQw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sanskrit-programmers+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages