--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Oh, Vishvas, really, you have discovered that I existed ! I am deeply honored !The short answer to "Could you publish your software on github" is NO, as I explained to Mr Dhaval Patel at the last WSC in Bangkok, after I handed him a key with my system,ready to install, with all sources: "I am resolutely keeping out of hectic activities such as social networking, skype, and cloud development systems.In particular, I have no intention to develop my software on the GitHub platform, and in any case would need the agreement of my partners in this endeavor.I am of course interested in receiving comments and corrections from users and colleagues, but at a pace that is consistent with my feeble resources."Let me add that I do not believe in crowd sourcing of software or linguistic data by non-professionals, however high is their goodwill.Professional software demands careful design, competence in the application area, requirements analysis, tool coordination, coding discipline, testing, documenting, packaging, etc etc."Love, deep sentiments, emotional attachment" are just not enough ifone does not know basic computer science algorithms such as sorting or has only a romantic idea of Sanskrit grammar. PHP veneer infatuation is just not sufficient.So pushing in half-baked code snippets written in random programming languages for surface processing of Unicode devanagari is simply a loss of time.I had a look at https://sites.google.com/site/sanskritcode/home/plans and I am just not much impressed by what has been jointly accomplished in 5 years time.Lots of ranting about "non-community based, closed source software and restricted data", that's for sure, but that's not a substitute for hard work, is it?A Survey of software is indeed available. It appears however to be just a collection of stale links. Item "sandhi analysis", besides a broken link, points to an obsoletemirror site of my system copied at UoH 5 years ago. Is my original site so hard to locate with Google ? Where is the crowd supposed to source this information page ?As the promoter of the Ocaml and Coq efforts, I do not have lessons about free software to receive from anyone, specially from professionals from private companies that make profit with proprietary software.I released my computational linguistics toolkit Zen in 2002, complete with all sources under LGPL and a full literate programming documentation as tutorial.I had a very hard time with the Xerox company, that considered this area as their monopoly, and they tried hard, through thinly disguised referees on their payroll, to prevent me from publishing its concepts, let alone its source code.At the same time I had to intimate Wikipedia administrators into better control of their servers, in order to eradicate stolen copies of my lexicon hypertext data, repainted under a different CSS style sheet, and with careful removal of my author and copyright annotations, well hidden in some deep well in a Prague Wikipedia server by some copyleft crackpot. By chance, Google was peeking down the well, and I could spot it :-)I first released my Sanskrit Engine set of Web services in August 2003, and advertised it publicly in the Indology list on September 8th, 2004.I released my morphology XML data banks under a free license at the same time, and many sites worldwide could start work on Sanskrit processing using these resources, as witnessed by the Heritage goldbook.I organized a scientific meeting in Paris on the topic in 2007, to which I invited many scholars and pandits to join a collaborative effort. It has had 5 occurrences worldwide since.The joint software effort, with my partners at University of Hyderabad and IIT Kharagpur, representing 40000 lines of code in 120 modules of dense functional programming specifications, is now distributed on demand as a stand-alone set of Web services, under LGPL license. Its reader is available as a sub-service for analyzing the texts of the Sanskrit Library, and as a segmentation plug-in in Amba Kulkarni's Sanskrit parser.A full literate programming documentation of 763 pages is available for whoever wishes to learn and join this effort.This is apparently not enough for free software ayatollahs, so now wannabe experts want to impose us crowd sourcing development processes, which just do not make any sense in our context.How dare you send a letter to me, with a public cc to a newsgroup I do not subscribe to, as an attempt to bully me into obedience to your schemes ?Don't you have any notion of common civility ? Couldn't you have first discussed the matter with me privately, instead of sneaking in a semi-public cc for putting pressure on me ?Is this a proper way to introduce yourself to me, I have no recollection we had earlier contact ?Do you only have any sense of professional ethics ? What would you say if I wrote to the legal services of your employer, forwarding your email, emitted from an IP number recorded under Google's juridiction, to inquire whether it is recommended practice for Google engineers to bully independent developers into using the GitHub platform, so that "we can examine and discuss the source code and use portions of it within our tools more easily" ? With a cc to Inria's intellectual property cell, to boot.Sometimes it is useful to think before shooting one's mouth off.Please, Vishvas, come back to your sensesGH
publicly available resources your efforts have linked on our unrepresentative, incomplete, badly out-of-date website.
Does anybody know of an open source implementation of a sandhi-splitting tool?
--
--
--
Please read these two papers by Oliver. It should be useful for this pursuit. Kindest Wishes. Martin
--
You received this message because you are subscribed to a topic in the Google Groups "sanskrit-programmers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sanskrit-programmers/Ms3Fdv-axMw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sanskrit-program...@googlegroups.com.
This is used to train a recurrent neural network, containing: an input layer, a hidden forward and backward layer, and an output layer. The size of the input layer is the number of distinct input phonemes. At time t, input layer receives the phoneme observed at position t in a string. The hidden forward and backward layers capture the left and right context. LSTM cells are used in this layer. "The output layer receives the individual outputs from both hidden layers and performs a softmax regression for the desired target values. The network is trained with stochastic gradient descent"
This gets to an overall accuracy of 93.24% with the full dataset; with 10000 and 500000 samples the accuracy is 77.92% and 91.03% respectively. Note that all this uses only a known analysis of texts, no linguistic information or lexical or morphological resources (no language models, no dictionary or list of inflected forms).
Then Viterbi decoding to find the most probable analysis, using bigram probabilities learned from the annotated corpus. Gets to about 94.4% accuracy.
(+Arun who has examined some similar work in the past, and may have some thoughts about it.)I think there are a couple of directions in which our effort can be usefully expended, depending on our interests:1. The work of actually performing automated analysis of sentences, splitting sandhi, etc.2. Given such an analysis (whether created manually or automatically), tools and UIs to display the results of such annotation to a reader, in a form that is easily accessible (will run in the browser, looks good on mobile phones, can work offline if the data is available, adapts to the user's preferences, etc).
Since I have zero experience with neural nets these spring to mind -Suppose we wanted to train such a network. Would we need "cloud power" or would a single modern computer do? If the former, are people able to use something like Google's https://en.wikipedia.org/wiki/TensorFlow on some cloud provider's computers?
This gets to an overall accuracy of 93.24% with the full dataset; with 10000 and 500000 samples the accuracy is 77.92% and 91.03% respectively. Note that all this uses only a known analysis of texts, no linguistic information or lexical or morphological resources (no language models, no dictionary or list of inflected forms).So they're not using the entire 2.6 million strings or even close. Is this corpus available online or by request?
Then Viterbi decoding to find the most probable analysis, using bigram probabilities learned from the annotated corpus. Gets to about 94.4% accuracy.Not that impressive a gain, as I'd anticipated :-)