DCS data now available

68 views
Skip to first unread message

Shreevatsa R

unread,
Dec 15, 2018, 9:32:21 PM12/15/18
to sanskrit-programmers
Some here may know already, but fyi for others: Oliver Hellwig has made the data of the Digital Corpus of Sanskrit (http://kjc-sv013.kjc.uni-heidelberg.de/dcs/) available under a CC-BY license: see announcement here and details on GitHub here (direct link currently: here — but that may become obsolete).

I tried it: it's a 68 MiB .zip file containing a dcs.sql file that's 357 MiB decompressed. I was able to load it into a MySQL database (well, I had MariaDB but...) and play with it a bit, using "show tables" and "describe <tablename>" and so on. Everything seems to be there: the texts, the annotations, etc. Example queries you can try:
select * from text_lines where chapter_id=1067;
for the first sarga of the Rāmāyaṇa, 
select * from word_references where sentence_id in (108403, 108404);
for its last verse, etc.

This is a wonderful resource and I expect it's going to be invaluable to the community.

Shreevatsa R

unread,
Jun 25, 2019, 4:56:24 PM6/25/19
to sanskrit-programmers
Just for the record, leaving here a set of instructions on a possible way to use this data, as it is somewhat cumbersome and not obvious for those of us less familiar with SQL/databases. The following works on macOS and will probably work elsewhere too:
  • Download dcs.zip (it's about 68 MiB)
  • `unzip dcs.zip` (results in `dcs.sql` which is 357 MiB)
  • Install mysql or equivalent, e.g. with `brew install mariadb` or mysql, 
  • `mysql.server start` (apparently one needs to start a server and then connect with a client, even on local machine)
  • `(echo "create database dcsdb; use dcsdb;"; cat dcs.sql) | mysql -u root` and wait a really long time (3+ minutes)
  • Now a local database has been created called dcsdb, and one can start using it with `mysql -u root dcsdb`
Example queries as mentioned in the previous email:
  • `select id, line, strophe, stanza from text_lines where chapter_id=1067;` gets the first sarga of the Ramayana
  • `select * from word_references where sentence_id in (108403, 108404);` gets the last verse of that sarga
Etc. Some way to navigate this more easily may be nice...

Hrishikesh Terdalkar

unread,
Jun 26, 2019, 5:08:46 AM6/26/19
to sanskrit-p...@googlegroups.com
Some of us might not have UTF-8 as default MySQL encoding. Which might result in the IAST encoding text not being stored properly. This happened with me, resulting in lot of entries having '?' characters in them. 

Thus, it might be better to ensure that the database created supports unicode charset.

This can be achieved by "create database dcsdb character set utf8mb4 collate utf8mb4_unicode_ci;"
So, the entire command to import would now be,

(echo "create database dcsdb character set utf8mb4 collate utf8mb4_unicode_ci; use dcsdb;"; cat dcs.sql) | mysql -u root

After which, the database is imported properly.
-
हृषीकेश


--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAKEM%3DPM-dJvqX%3D20M27oCh%3DKdWW21ZPbAz7PxdKJ0ORnP%3D2q8g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Shreevatsa R

unread,
Sep 1, 2020, 2:00:05 PM9/1/20
to sanskrit-programmers
In case anyone is interested, using https://github.com/dumblob/mysql2sqlite I converted it to an sqlite file. It is larger, but not having to deal with all the server/client stuff is a relief.

For now I've put it here: https://drive.google.com/drive/folders/1cBVstjZy3kfhJuj1iHQ28B3L_n0gt_yJ (Feel free to take it and upload it somewhere more "proper" like GitHub or whatever).
It's a 246 MiB (258 MB) zip file that uncompresses to a 718 MiB (753 MB) file. (Don't know why it's 2x larger.)
I find SQLite easier to get running than MySQL which I've often had trouble with (having a server running and connecting to it etc). As the database is just a single file, not only is it just a matter of running "sqlite3 dcs.sqlite3", but there are also many independent applications and libraries for many languages for reading SQLite. (Here's a podcast episode, with the creator of SQLite.)








Reply all
Reply to author
Forward
0 new messages