New version of the Digital Corpus of Sanskrit

320 views
Skip to first unread message

Oliver Hellwig

unread,
Mar 16, 2012, 11:53:45 AM3/16/12
to samskrita
Dear list,

a new version of the Digital Corpus of Sanskrit is now available at
http://kjc-fs-cluster.kjc.uni-heidelberg.de/dcs/index.php

Current version 1.6 contains, among other new features, a public
interface for discourse semantic annotation (FrameNet, co-annotators
are welcome), a (hopefully) nicer display of query results, and a
"Book search" function for scanned Sanskrit books.

Please refer to
http://kjc-fs-cluster.kjc.uni-heidelberg.de/dcs/index.php?contents=help_center
for an overview of added texts and features.

As always, feedback and ideas for improving interface and
functionality are highly appreciated.


Best regards,
Oliver Hellwig
----
Dr. Oliver Hellwig
South-Asia Institute, Unversity of Heidelberg
Im Neuenheimer Feld 330
69120 Heidelberg, Germany

Vishvas Vasuki

unread,
Aug 19, 2012, 1:17:20 AM8/19/12
to sams...@googlegroups.com, hell...@gmx.de
Dear Dr Hellwig,

I came across your wonderful website, and I had a request to make that would doubtless help many samskrita students and scholars. 

Could you please publish a table listing for each word (occurring, say in the rAmAyaNa or mahAbhArata) the frequency with which it appears, together with any other data you may have about the word? It would additionally be helpful if you were to repeat the exercise for all texts in your corpus (not just the great epics), just for comparison.

These tables would help samskrita students worldwide improve their knowledge of samskrita vocabulary - they could prioritize learning the most frequently occurring words in typical texts they may be interested in.

--
Humble regards and much anticipation,
Vishvas

श्रीमल्ललितालालितः

unread,
Aug 19, 2012, 1:41:42 AM8/19/12
to sams...@googlegroups.com
There is something related to it in my mind.
I have edited a few books in MS Word and Adobe Indesign. Is there any way make शब्दानुक्रमणिका , श्लोकानुक्रमणिका , etc. automatically in these programs ? It will help many.

murthy

unread,
Aug 19, 2012, 10:43:30 AM8/19/12
to sams...@googlegroups.com
Will not a devanagari sorter, such as the one that comes along with baraha, do the job? If you feed the first line of slokas, it should arrange it alphabetically. But for all words, they have to be fed individually which is a massive job.
Regards
Murthy
--
You received this message because you are subscribed to the Google Groups "samskrita" group.
To post to this group, send email to sams...@googlegroups.com.
To unsubscribe from this group, send email to samskrita+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/samskrita?hl=en.

Shrivathsa B

unread,
Aug 19, 2012, 10:08:59 AM8/19/12
to sams...@googlegroups.com
hariH OM,
svaamii,

   in case you have the list in unicode, to make anukramaNikA will be easy. In case not, it will be very difficult.

svasti,
                JAYA BHAVAANII BHAARATII,
                                                                        shrivathsa.

2012/8/19 श्रीमल्ललितालालितः <lalitaa...@gmail.com>

श्रीमल्ललितालालितः

unread,
Aug 19, 2012, 11:41:46 AM8/19/12
to sams...@googlegroups.com
Making list is itself a difficult task. So, is there any script, etc. to do that automatically ?

Hnbhat B.R.

unread,
Aug 19, 2012, 12:42:12 PM8/19/12
to sams...@googlegroups.com
In Microsoft Word document, there is indexing tool, which can be used to index in English alphabetical order and also probably in with Sanskrit Alphabets also, if font is selected Devanagari, font.

Only some editing may be needed, after inserting the index. Also there is contents table also possible, if marked in the text, in the same way, indexing words or beginning of the verses.
Field codes: Index field
{ INDEX [Switches ] }

Builds and inserts an index. The INDEX field collects index entries specified by XE (Index Entry) fields. The INDEX field is inserted by the Index and Tables command on the Reference submenu (Insert menu).

Switches

\b Bookmark 
Builds an index for the portion of the document marked by the specified bookmark. The field { INDEX \b Select } builds an index for the portion of the document marked by the bookmark "Select."

\c Columns 
Creates an index with more than one column on a page. The field { index \c 2 } creates a two-column index. You can specify up to four columns.

\d "Separators" 
Used with the \s switch, specifies the characters (up to five) that separate sequence numbers and page numbers. The field { INDEX \s chapter \d " : " } displays page numbers in the format "2:14." A hyphen (-) is used if you omit the \d switch. Enclose the characters in quotation marks.

\e "Separators" 
Specifies the characters (up to five) that separate an index entry and its page number. The { INDEX \e "; " } field displays a result such as "Inserting text; 3" in the index. A comma and space (, ) are used if you omit the \e switch. Enclose the characters in quotation marks.

\f "Identifier" 
Creates an index using only the specified entry type. The index generated by { INDEX \f "a" } includes only entries marked with XE fields such as { XE "Selecting Text" \f "a" }. The default entry type is "I".

\g "Separators" 
Specifies the characters (up to five) that separate a range of pages. Enclose the characters in quotation marks. The default is an en dash (–). The field { INDEX \g " to " } displays page ranges as "Finding text, 3 to 4".

\h "Heading" 
Inserts text formatted with the Index Heading style between alphabetic groups in the index. Enclose the text in quotation marks. The field { INDEX \h "—A—" } displays the appropriate letter before each alphabetic group in the index. To insert a blank line between groups, use empty quotation marks: \h "".

\k 
Defines the separators between cross references and other entries. 
\l "Separators" 
Specifies the characters that separate multiple-page references. The default characters are a comma and a space (, ). You can use up to five characters, which must be enclosed in quotation marks. The field { INDEX \l " or " } displays entries such as "Inserting text, 23 or 45 or 66" in the index.

\p "Range" 
Compiles an index for the specified letters. The field { INDEX \p a-m } generates an index for only the letters A through M. To include entries that begin with characters other than letters, use an exclamation point (!). The index generated by { INDEX \p !--t } includes any special characters, as well as the letters A through T.

\r 
Runs subentries into the same line as the main entry. Colons (:) separate main entries from subentries; semicolons (;) separate subentries. The field { INDEX \r } displays entries such as "Text: inserting 5, 9; selecting 2; deleting 15".

\s 
When followed by a sequence name, includes the sequence number with the page number. Use the \d switch to specify a separator character other than the default, which is a hyphen (-).

\y 
Enables the use of yomi text for index entries.  
\z 
Defines the language ID that Microsoft Word uses to generate the index. 
Examples

The field { INDEX \s chapter \d "." } builds an index for a master document. Each subdocument is a chapter; the chapter titles include a SEQ field that numbers the chapters. The \d switch separates the chapter number and page number with a period (.). An index generated from this field looks similar to the following:

Aristotle, 1.2
Atmosphere
     Earth, 2.6
     Jupiter, 2.7
     Mars, 2.6

the template can be customized. The above is the template of word index. The same may be used for Indexing Sanskrit fonts, beginning of the words. This is the template of Office XP word document. The same is there in Contents table, which should be marked as heading, sub heading etc. customized to accommodate the Sanskrit fonts (Unicode will be better).

Just see the help in word document for the templates of both Contents Table and Indexing.






--
You received this message because you are subscribed to the Google Groups "samskrita" group.
To post to this group, send email to sams...@googlegroups.com.
To unsubscribe from this group, send email to samskrita+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/samskrita?hl=en.



--
Dr. Hari Narayana Bhat B.R. M.A., Ph.D.,
Research Scholar,
Ecole française d'Extrême-OrientCentre de Pondichéry
16 & 19, Rue Dumas
Pondichéry - 605 001


dhaval patel

unread,
Aug 19, 2012, 2:25:05 PM8/19/12
to sams...@googlegroups.com
Respected Lalitalalita ji,
just have a look at the below shown site..
If this satisfies your need, just let me know.
It breaks the whole text in words.. It uses a space as separator. Anything which is separated by a space is stored in a different location.


श्रीमल्ललितालालितः

unread,
Aug 19, 2012, 3:12:45 PM8/19/12
to sams...@googlegroups.com
On Sun, Aug 19, 2012 at 10:12 PM, Hnbhat B.R. <hnbh...@gmail.com> wrote:
In Microsoft Word document, there is indexing tool, which can be used to index in English alphabetical order and also probably in with Sanskrit Alphabets also, if font is selected Devanagari, font.

Only some editing may be needed, after inserting the index.

Thanks, bhat Ji. I saw this option, but haven't used it ever. I'll have to read help pages for it.
 
Also there is contents table also possible, if marked in the text, in the same way, indexing words or beginning of the verses.

I've made TOC by marking H1, H2, etc.
 

This almost went above my head. Is this the coding MS Word uses ?
 
Just see the help in word document for the templates of both Contents Table and Indexing.

Sure. Thanks for pointing in right direction.



Although I know a little about HTML and CSS, I've no knowledge of PHP.
So, I will have to exert me to learn this.
Anyway, this page does separate the words on a web-page. Let us suppose that I copy-paste the typed document and get list of words. Even then there is no way to create link between the list and the original document. I mean to say that there will be no way to automatically add page number and line number of occurrence of words in to the list of words generated by explode.

विश्वासो वासुकेयः (Vishvas Vasuki)

unread,
Aug 19, 2012, 5:11:53 PM8/19/12
to Oliver Hellwig, sams...@googlegroups.com
Dear Dr Hellwig,

You got it! I did test the details pages for words in the corpus, but the following is exactly what I seek:

"Word" "Mbh Frequency" "Ra Frequency" .. "word type" "link to details"
"ca" 10000 9700 .. "avyaya" "..."
"tad" 9998 7899 .. "prAtipadikam" "..."
...

Extra annotations like "word type" etc.. could be anything you see fit; but more the details the better - readers can always ignore extra columns.

I am very optimistic that such tables would help people seeking mastery of words - that they may be delighted by Sanskrit word-smithery and be great word-smiths themselves. They would be akin to modern improvements (in a manner of speaking) over compendia like amarakosha, which were used for that purpose.

--
With thanks,
Vishvas /विश्वासः


Nityanand Misra

unread,
Aug 19, 2012, 7:05:10 PM8/19/12
to sams...@googlegroups.com, sams...@googlegroups.com
Can be easily done using sed, grep and/or awk in Linux. Or any scripting language in Windows. R can be used too. If you can share the source file I can help.

Sent from my iPhone

Nityanand Misra

unread,
Aug 19, 2012, 7:31:32 PM8/19/12
to sams...@googlegroups.com, sams...@googlegroups.com
In case you are not aware, there are many online tools to convert between Unicode and other transliteration schemes. You can search on Google.

Sent from my iPhone

श्रीमल्ललितालालितः

unread,
Aug 19, 2012, 10:28:05 PM8/19/12
to sams...@googlegroups.com
On Mon, Aug 20, 2012 at 4:35 AM, Nityanand Misra <nmi...@gmail.com> wrote:
Can be easily done using sed, grep and/or awk in Linux. Or any scripting language in Windows. R can be used too. If you can share the source file I can help.

I've a linux set-up on my other partition. Can you guide me to how to do it ? A simple example with any small doc file may help me a lot. Moreover, please point me to the resources related to this.

Nityanand Misra

unread,
Aug 19, 2012, 11:10:59 PM8/19/12
to श्रीमल्ललितालालितः, sams...@googlegroups.com
You can search Linux man pages for sed, grep and awk. Very easy to use but extremely powerful tools to say, extract first few words of each verse, et cetera. Then you can use sort command in Linux or in a spreadsheet on the Unicode to generate the alphabetical list. Many resources on Internet to help with these, Google search would help.

I can give an example sometime in future, it will be easier if you can share a few verses of your example source file. Plain text would do, anyway the Linux tools work on text files and not dic.

Sent from my iPhone

Anunad Singh

unread,
Aug 20, 2012, 7:35:24 AM8/20/12
to sams...@googlegroups.com
In the above discussion, more than one problem (tools for it) has been raised. In short, I have following to say-

1) OpenOffice can be used to make linked index-
http://www.tutorialsforopenoffice.org/tutorial/Introduction_To_Indexes.html

http://www.tutorialsforopenoffice.org/tutorial/Create_and_Modify_A_Table_of_Contents.html

http://doancia.blogspot.in/2010/12/how-to-create-linked-index-in.html


2) For creating word frequency list and concordance, following free tool can be used. It supports Unicode (and hence Devanagari).

AntConc
http://www.antlab.sci.waseda.ac.jp/software.html

I myself have not done any indexing in OpenOffice but have used AntConc concordancer program with Devanagari text.

-- Anunad

Subrahmanian R

unread,
Aug 20, 2012, 12:25:47 PM8/20/12
to sams...@googlegroups.com
Respected Sirs,
 
I happen to have a list of words occurring in the Ramayana and Mahabharata. Obviously it was downloaded by me from the net as part of a collection. This may be useful for the purpose mentioned, though it lists each word only once and does not facilitate frequency count.
 
Respectfully
R Subrahmanian

--
You received this message because you are subscribed to the Google Groups "samskrita" group.
vocabulary.pdf

विश्वासो वासुकेयः / Vishvas Vasuki

unread,
Aug 20, 2012, 5:59:30 PM8/20/12
to sams...@googlegroups.com, Oliver Hellwig
Dear Prof. Oliver,

On Sun, Aug 19, 2012 at 11:20 PM, Oliver Hellwig <oliver....@indsenz.com> wrote:

I will implement such a feature, most probably without a beautiful user interface, which mostly takes most of the time to create, and post to the list when the data are ready.
As soon as you have such a table, you can, of course, process it with any tools our colleagues mentioned in this discussion. Personally, I prefer R.

Excellent! That is precisely what I seek too - raw data - in any of the popular formats like csv or tsv, which I can then cut, filter, sort as required, and use in a spreadsheet program of my choice.

While at it, may I request that samskrita content be printed in devanAgarI too (eg one could have two columns: word in devanAgarI, word in latin alphabet rather than just the latter) - because I am not familiar with convenient command line sanskrit-script converters (though I am sure they exist).

--
Much thanks,
Vishvas / विश्वासः


Vishvas Vasuki

unread,
Aug 22, 2012, 1:16:32 PM8/22/12
to sams...@googlegroups.com
सूचनाभिः कृतविदस्मि अनुनाद!

विश्वासो वासुकेयः / Vishvas Vasuki

unread,
Sep 7, 2012, 8:02:26 PM9/7/12
to sams...@googlegroups.com
Dear Oliver,

Thank you! It is a delight to puruse. Who knew that khara appears over 150 times in the रामायण! I had a few suggestions which might increase the delight of users. In the order of roughly decreasing priority, they are:

1 It may be a good idea to allow users to see the bigger of the following lists, rather than the current cutoff of 500 :
** The list of all words appearing more than - say 10 - times in the text.
** The list of 500 most frequent words.

The rationale for the above is that this would be a better filter for important words. Right now, the 500th most frequent word in mahAbhArata is आप् (appearing over 260), times. I wager words appearing 50 times would be interesting too. (rAmAyaNa and mahAbhArata loaded quite quickly for me. I would be happy to wait longer for the page to load..)

2 It would be nice to see the word list in devanAgarI, the way I can the rest of the corpus website.

3 On clicking a word, the description page in a new tab. If one needs to go back to the word-list from the description page, the word list needs to get reloaded, which is can be a slow process. Of course users can open the page themselves in a new window.


In any case, thanks again for this excellent feature!

--

विश्वासो वासुकेयः / Vishvas Vasuki

unread,
Sep 10, 2012, 7:38:20 PM9/10/12
to Oliver Hellwig, sams...@googlegroups.com
Dear Prof Oliver,

On Mon, Sep 10, 2012 at 11:18 AM, Oliver Hellwig <oliver....@indsenz.com> wrote:
Point 3 may take a bit more time (any ideas how this can be done in a smart way?).

That is simple too, I think.
Instead of <a href="index.php?contents=einzelwort&amp;IDWord=51268">ca</a>, the program will need to produce:
<a href="index.php?contents=einzelwort&amp;IDWord=51268" target="_blank">ca</a>

विश्वासो वासुकेयः (Vishvas Vasuki)

unread,
Apr 10, 2013, 2:39:20 AM4/10/13
to sams...@googlegroups.com, hell...@gmx.de
Dear Oliver,

On Sat, Aug 18, 2012 at 10:17 PM, Vishvas Vasuki <vishvas...@gmail.com> wrote:

As always, feedback and ideas for improving interface and
functionality are highly appreciated.

Recently I had occasion to use your server again, and it ocured to me that it would be very convenient if I could, for a given book and chapter, get a distinct url which would take people to that exact page directly. Technically, it would be a simple change to enable GET protocols (along with POST), I imagine..

[One thing I like about reading texts from your server is that one can see roots for each word - that definitely would benefit study-groups such as mine grown in our mastery of the language.]

--
--

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Sep 22, 2013, 4:58:47 PM9/22/13
to Oliver Hellwig, संस्कृतसन्देशश्रेणिः samskrta-yUthaH
+samskritam

Dear Oliver, 

To add another request: I would love a devanAgarI interface to the search page (so that I wont have to click around to find the right way to punch the characters). 

If that's too much work, you could simply provide an interface which accepts any of the popular transliteration schemes (see wiki) - I can then write a devanAgarI interface for it without any trouble (as I did for another german sanskrit tool.). 




On Thu, Apr 11, 2013 at 12:07 AM, Oliver Hellwig <hell...@gmx.de> wrote:
Dear Vishvas,

you are right, this is a nice and helpful idea. Will try to implement it, but I cannot promise anything right now, because we are quite busy here with the next OCR.
By the way: Have you ever considered adding new texts to the DCS database? I admit that SanskritTagger is not the most user friendly program in the universe :-), but I like the idea of continuously building the corpus from distributed sources. Just send me a mail in case you would like to participate.

Best, Oliver
Reply all
Reply to author
Forward
0 new messages