The sanskrit parser library indeed has similar goals as your request (disclaimer: I am one of the contributors). It is currently still under development, but can produce reasonably good results. E.g. for the following input (corresponding to the first sentence you provided):
iti śrīsaṃkṣiptavedāntaśāstraprakriyāṃ śrīmatparamahaṃsaparivrājakācāryaśrīmacchaṅkarabhagavataḥ kṛtau bahimukhāntaḥ praṇavamajñānabodhinī adhyātmavidyopadeśavidhiḥ samāptaḥ
the output in SLP1 encoding is:
[iti, SrI, saMkziptavedAntaSAstraprakriyAm, SrImat, paramahaMsaparivrAjakAcArya, SrImat, SaNkara, Bagavatas, kftO, ba, him, uKAm, tas, praRavam, ajYAnaboDinI, aDyAtmavidyA, upadeSa, viDis, samAptas]
For very long sentences, the splitting can take some time because of the graph path-finding algorithm. The above example took a couple of minutes on my laptop. Whether your web server can afford to spend that much time splitting long sentences is up to you. There are plans to add NN based capabilities in the future that may reduce the computational cost.
Another limitation is that the sanskrit_parser library currently does not handle any errors in the input. E.g. your original input had
vidy*
au*padesa
which had to be corrected to
vidy*
o*padeśa, bahimukha probably needs to be bahirmukha (which is why there is a weird split in the output), etc.
However, if you are just looking for text reuse detection, would it still work if the result is not semantically/morphologically valid? If so, you might be able to use some consistent sub-unit tokenization algorithms such as sentencepiece (
https://github.com/google/sentencepiece) or morphessor (
http://morpho.aalto.fi/projects/morpho/). For the former, a model trained on DCS data is available as part of this project -
https://github.com/cvikasreddy/sktYou will still have to deal with errors/inconsistencies in the input, and probably some other issues that I haven't thought of :)