Cooperation for tibetan text corpus and open source manuscript OCR development

30 views

Skip to first unread message

Open Source Buddhism Library

unread,

Mar 14, 2021, 12:10:28 AM3/14/21

to tibetan-initi...@googlegroups.com

Good day.

INTRODUCTION
Open Source Buddhism Library initiative it is team of developers and practitioners dedicated to mission of Dharma texts public access, study and practice.
About this initiative you may read here
https://www.buddism.ru///___DHARMA___/1548842022.phtml?edit=print

COOPERATION PROPOSAL

1. At the present time, based on 5 years of research developed algorithms for optical computer recognition OCR of the the Tibetan manuscripts.
For about 7 years our library support free OCR service for tibetan and sanskrit texts. This service established on base of Trace Foundation grant and library users donations.
The main core of this service it is OCRLib c++ development. https://github.com/RimeOCRLIB/OCRLib
www.buddism.ru/ocr
For task of tibetan and sanskrit manuscript OCR we have in development open source OCR library. This library core is C, ASM and C++ based.
This OCR copre it is four level convolution network with functional network level segmentation. Every level of network is representation of human physiology recognition process
represented as fast and optimized C and ASM functions. For hand write texts recognition it is used topological graph analysis of image skeleton structure.
https://www.buddism.ru///ocrlib/documentation/OCRLib_documentation2018_eng.pdf
This development was subject of few official and private grant donation and we represent that efforts as collective work.

2. Following the development now can be set the task of recognition of all the materials of the library and create a single Corpus of Buddhist texts, including printed books and manuscripts.
This corpus need to be include tibetan, sanskrit, pali, mongolian chinese and western languages Dharma texts with opportunity to search, grammar and dictionary analysis.
At present corpus include 5gb of tibetan, pali and western texts. Example of text corpus record
http://www.buddism.ru:4000/?index=5030&field=1&ln=eng&ocrData=TibetanUTFToEng&mode=read&ln=eng
This text corpus development include tibetan-sanskrit-western joined dictionary edition and translation memory development.
At present joined tibetan dictionary include 360 000 unique articles and translation memory database include 150 000 translation records.

3. On base of OCR and grammar analysis development it is need publish free and open for public comparative edition of the whole Tibetan Tripitaka Canon and translate three volumes on Russian and English.
At present we try to reach 84000.co team with this proposal. We may represent of lider russian translators works on base of library tibetan texts knowledge base.

Sarva Mangalam!
Open Source Buddhism Library
www.buddism.ru
Alexander Stroganov

letter_OSBL_Dict2020.pdf

CV_tibetan_OCR_team.txt

Dear Alexander.pdf

His_Holiness_Karmapa_open_letter_2019.pdf

Prof.Thurman_Rime_Support Letter 041713.pdf

YESHE_DE_OSBL_Letter.rtf

Reply all

Reply to author

Forward

0 new messages