Language codes to use for sanskrit transliteration

383 views
Skip to first unread message

Shree Devi Kumar

unread,
Apr 9, 2016, 12:26:25 PM4/9/16
to sanskrit-programmers
What are the recommended language codes to be used for sanskrit transliteration (ITRANS, IAST) in webpages?

I have used en_IN for ITRANS and sa-Latn for IAST/ISO romanization, but they are marked with errors/warninsg - http://hreflang.ninja/check/?url=http%3A%2F%2Fsanskritdocuments.org%2Fdoc_deities_misc%2FrAdhAparihAra.html%3Flang%3Dsa-Latn

Here is a sample of how it is defined in the webpage - view-source:http://sanskritdocuments.org/doc_z_misc_major_works/meghanew.html?lang=en-IN


<meta name="keywords" content="॥ मेघदूत (कालिदास) ॥, .. meghadUta (kAlidAsa) .., , philosophy \ hinduism \ religion , kalidasa , pramukha , doc_z_misc_major_works, major_works, , meghanew, Sanskrit, UTF-8, Unicode, Devanagari, ITRANS, IAST, Roman, Sanskrit Transliteration,संस्कृत ">
<meta name="description" content="॥ मेघदूत (कालिदास) ॥, , .. meghadUta (kAlidAsa) .., Sanskrit text in Unicode Devanagari, other Indian languages, ITRANS and IAST (Roman) encoding as pdf and webpage">
<base href="http://sanskritdocuments.org/">
<link rel="alternate" href="./doc_z_misc_major_works/meghanew.html?lang=sa" hreflang="x-default" >
<link rel="alternate" href="./doc_z_misc_major_works/meghanew.html?lang=sa" hreflang="sa" >
<link rel="alternate" href="./doc_z_misc_major_works/meghanew.html?lang=sa-Latn" hreflang="sa-Latn" >
<link rel="alternate" href="./doc_z_misc_major_works/meghanew.html?lang=en-IN" hreflang="en-IN" >
<link rel="alternate" href="./doc_z_misc_major_works/meghanew.html?lang=ta" hreflang="ta" >
<link rel="alternate" href="./doc_z_misc_major_works/meghanew.html?lang=kn" hreflang="kn" >
<link rel="alternate" href="./doc_z_misc_major_works/meghanew.html?lang=te" hreflang="te" >
<link rel="alternate" href="./doc_z_misc_major_works/meghanew.html?lang=ml" hreflang="ml" >
<link rel="alternate" href="./doc_z_misc_major_works/meghanew.html?lang=or" hreflang="or" >
<link rel="alternate" href="./doc_z_misc_major_works/meghanew.html?lang=gu" hreflang="gu" >
<link rel="alternate" href="./doc_z_misc_major_works/meghanew.html?lang=bn" hreflang="bn" >
<link rel="alternate" href="./doc_z_misc_major_works/meghanew.html?lang=pa" hreflang="pa" >
<link rel="alternate" href="./doc_z_misc_major_works/meghanew.html?lang=hi" hreflang="hi" >

Any suggestions ...

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Apr 9, 2016, 1:28:14 PM4/9/16
to sanskrit-programmers
Just guessing: would sa-US work?

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
--
Vishvas /विश्वासः

Shreevatsa R

unread,
Apr 9, 2016, 2:06:34 PM4/9/16
to sanskrit-programmers
Both en-IN and sa-Latn are valid, as you can check from https://r12a.github.io/apps/subtags/ -- it is probably just a bug in hreflang.ninja that it doesn't recognize them. (Unfortunately, in this it may be reflecting browser bugs too: poor support for certain tags.)

Note that "en-IN" means "English as spoken in the India region", and you should use it for Indian-English text, not for ITRANS Sanskrit text.
Whether you're using ITRANS or IAST/ISO-15919 romanization, it is still Sanskrit written in the Latin script, so you should use "sa-Latn" for both.

If you want to indicate the specific script convention used, I think you can in principle use the "private use" section of language tags, e.g. sa-Latn-x-IAST and sa-Latn-x-ITRANS. But there's no standard for this; you'll be establishing your own convention.

A good reference/starting point seems to be https://www.w3.org/International/articles/language-tags/ 

--

ShreeDevi Kumar

unread,
Apr 10, 2016, 12:30:04 AM4/10/16
to sanskrit-p...@googlegroups.com
Thank you, Shreevatsa.

Actually, all the linked pages are sanskrit text transliterated into various Indian scripts and roman transliteration, so the language codes should be:

sa-Latn for IAST/ISO 
sa-Deva for Devanagari
sa-Beng for Bengali
sa-Gujr for Gujarati
sa-Guru for Gurmukhi - Panjabi
sa-Knda for Kannada
sa-Mlym for Malyalam
sa-Orya for Oriya/Odia
sa-Taml for Tamil
sa-Telu for Telugu

That still leaves the problem of ITRANS transliteration....

Maybe 'Sanskrit Programmers' group can register a variant with IANA registry - similar to https://www.iana.org/assignments/lang-subtags-templates/alalc97.txt
for ALA-LC Romanization, 1997 edition

In fact, initially I had setup the pages to use
sa-Script format tags, but Google Webmaster tools gave some errors.

Even now, I am getting errors related to return tags, even though each page has all the links ..

International Targeting | Language > 'sa' - no return tags
URLs for your site and alternate URLs in 'sa' that do not have return tags.

​Mabe, these are still not being used by the browsers and search engines in a meaningful way yet.​
 





ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

You received this message because you are subscribed to a topic in the Google Groups "sanskrit-programmers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sanskrit-programmers/iFUutpHuCLU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sanskrit-program...@googlegroups.com.

ShreeDevi Kumar

unread,
Apr 10, 2016, 4:30:29 AM4/10/16
to sanskrit-p...@googlegroups.com

some variant subtags denote a particular variant of a system of writing or transliteration. For example,zh-Latn-wadegile is Chinese written in the Latin alphabet, according to the transliteration system developed by Thomas Wade and Herbert Giles; ja-Latn-hepburn is Japanese written in the Latin alphabet using the transliteration system of James Curtis Hepburn.

​So, ​
 
​sa-Latn-ITRANS could also be used ​
​instead of 
sa-Latn-x-ITRANS


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreevatsa R

unread,
Apr 10, 2016, 9:43:27 AM4/10/16
to sanskrit-programmers
zh-Latn-wadegile and ja-Latn-hepburn are valid only because they are in the IANA Subtag Registry, as

Type: variant
Subtag: wadegile
Description: Wade-Giles romanization
Added: 2008-10-03
Prefix: zh-Latn

and

Type: variant
Subtag: hepburn
Description: Hepburn romanization
Added: 2009-10-01
Prefix: ja-Latn

The language tag sa-Latn-ITRANS would be invalid, because "ITRANS" isn't present in the registry as a variant.
Again, you can check this by entering it at https://r12a.github.io/apps/subtags/ in the "Check" box.

Stepping back a bit for the broader picture, we can ask about the purpose of the language tag. I can imagine two categories of reasons for including the tag:

- Idealist/semantic: The W3C guidelines recommend it, it is a standard, it adds to the semantic understanding of the page, in principle it can help with accessibility (e.g. screen readers that see "en-GB" can switch to a British accent), in principle it helps the browser to have accurate information and it might become useful someday in the future, etc.

- Realist/pragmatic: It makes the user's experience better in some concrete way, e.g. their browser picks better fonts if it knows the script of the content, instead of treating it as English (Latin script) and using fallback fonts for glyphs not present in the font. But unfortunately there is not much evidence of this: today's browsers are very unlikely to know what to do even with sa-Latn (the first page of Google search results for "sa-Latn" contains two open Mozilla bugs, one of which I reported in 2009, and another which has been open since 2003), let alone sa-Gujr and the like.

Even the official W3 article on "Why use the language attribute?" https://www.w3.org/International/questions/qa-lang-why.en isn't very convincing, resting more on the hope of future browser support. It mentions styling, but that is equally well achieved by adding classes directly. It mentions hyphenation, line-breaking, justification, case, spelling and grammar checkers, and screen readers, but there's no browser support for any of these for Sanskrit. It mentions font selection, but Firefox at least goes crazy when it sees "sa-Latn" and doesn't even treat it as Latin-script text (and even ignores explicit styling!).

So if your reasons are in the former category then you ought to use standards-compliant (BCP47) tags, like sa-Gujr or sa-Latn-x-ITRANS. If your reasons are in the latter category, then unfortunately, browser support being what it is, my pessimistic estimate is that you might as well do the "wrong" thing by using "gu" or leaving it out entirely, and it makes not much difference. I guess one concrete way in which using the correct tags can be useful is if you have a global stylesheet on the site that selects for text in sa-Gujr (say) and applies appropriate style to it.

ShreeDevi Kumar

unread,
Apr 10, 2016, 10:52:50 PM4/10/16
to sanskrit-p...@googlegroups.com
Very well explained, Shreevatsa.

My reasons for using the language tags were the hope that the sanskrit text in Indian regional scripts would get picked up by the search engines as well as trying to be compliant with the internationalization standards. 

For now, I will leave the language tags as they are.

I figured out the error '
URLs for your site and alternate URLs in 'sa' that do not have return tags.
​' - 

The URLs without the ending in ?lang= ​
 
​(eg. http://sanskritdocuments.org/doc_devii/durgApancharatnam.html ) display the webpage in devanagari, but they are not included in the list of alternate urls ​.

Firefox at least goes crazy when it sees "sa-Latn" and doesn't even treat it as Latin-script text (and even ignores explicit styling!).

​Does Firefox display the ISO/IAST pages with sa-Latn on sanskritdocuments correctly with diacritics?​
 

​If not, that may need to be fixed.​
​ Please let me know the problem with the pages on Firefox.​



 

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Nikhilesh Jasuja

unread,
Apr 11, 2016, 4:48:52 PM4/11/16
to sanskrit-programmers
I wouldn't leave the language tags as they are. Like you said in a previous post, the tags should be sa-Beng, sa-Gujr etc. rather than simply bn and gu. That is the correct representation of the content that's on the page.

I'd also not hijack the links to the other script versions on one version of the page. Currently, you change the content on the page when a user clicks on say Gujarati but you do that via Javascript, which means the page URL does not change to ?lang=gj. You can leave it as a simple link, unintercepted by Javascript. Or -- if you must use Javascript -- you can use history.pushState() to change the URL.

ShreeDevi Kumar

unread,
Apr 11, 2016, 11:39:39 PM4/11/16
to sanskrit-p...@googlegroups.com
Nikhilesh,

All the webpages are in devanagari only. It is the javascript (sanscript) that changes the language script of the sanskrit content. 


I will check with the person who setup the transliterator script about changing the URL using history.pushState() 

Regarding the language tags, it would require changes to a large number of webpages and may lead to broken links, so will do that only if necessary.

Right now, another problem with rendering has cropped up when vedic accents and superscripts are used with text in Indian scripts other than devanagari (it was working fine earlier). eg. http://sanskritdocuments.org/doc_veda/narayanasukta.html?lang=ta or http://sanskritdocuments.org/doc_veda/narayanasukta.html?lang=gu

These are probably being marked as illegal combinations by rendering engines.



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--

ShreeDevi Kumar

unread,
Apr 12, 2016, 12:17:12 PM4/12/16
to sanskrit-p...@googlegroups.com
which has the suggested changes for hreflang tag ang change in URL etc.

However still getting errors when I test with
or
or

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Nikhilesh Jasuja

unread,
Apr 12, 2016, 12:29:14 PM4/12/16
to sanskrit-p...@googlegroups.com
The problem is sa-Latn-x-ITRANS. It's not a valid hreflang code. See here what a valid hreflang code is: https://hreflang.org/what-is-a-valid-hreflang/

You can use just the language (e.g. "sa"), or language + script (e.g. "sa-Gujr"), or language + region (e.g. "en-US"), or language+region+script. In your case, all your content is in the same language (sa) but different scripts. 

So you'll be all set if you get rid of the page http://sanskritdocuments.org/pfrs/test-sukta.html?lang=en (which, by the way, craps out on me in Chrome)


---

Shreevatsa R

unread,
Apr 12, 2016, 1:10:09 PM4/12/16
to sanskrit-programmers
Nikhilesh,

On what basis do you declare sa-Latn-X-ITRANS as an invalid hreflag code? As I pointed out earlier, the format of valid language tags is broader than what you said:

By this measure sa-Latn-x-ITRANS is a perfectly valid language tag, and is recognized as such by the tool at https://r12a.github.io/apps/subtags/ (linked from the w3.org page).

I think it's a bug in hreflag.org (and the other sites) that they don't recognize the tag as valid.

Again, to use my two terms from earlier:
- The idealist/semantic position is that sa-Latn-x-ITRANS is valid, because the standard says it's valid.
- The realist/pragmatic position would be sa-Latn-x-ITRANS is valid if the browsers you care about (recent versions of Chrome/Firefox/IE?) recognize it as valid, *not* whether random sites that poorly implement the standard recognize it as valid.

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.

ShreeDevi Kumar

unread,
Apr 12, 2016, 1:38:52 PM4/12/16
to sanskrit-p...@googlegroups.com
Thanks, I changed the hreflang tag for ITRANS text to sa-Latin-IN and that works :-)


The hreflang tag testing tool at https://app.hreflang.org supports the lang-script option, so the page now 'passes'.

I changed the links from relative URLs to absolute URL.

The romanized (ISO/IAST) sanskrit is marked at sa-Latn, ITRANS ia marked as sa-Latin-IN
Devanagari Sanskrit text is available in two flavors, one which uses a font with larger number of ligatures as sa and a simpler version as sa-Deva.

Here is the full list ..

<link rel="alternate" href="http://sanskritdocuments.org/pfrs/test-sukta.html"  hreflang="x-default" >
<link rel="alternate" href="http://sanskritdocuments.org/pfrs/test-sukta.html?lang=sa"  hreflang="sa" >
<link rel="alternate" href="http://sanskritdocuments.org/pfrs/test-sukta.html?lang=hi"  hreflang="sa-Deva" >
<link rel="alternate" href="http://sanskritdocuments.org/pfrs/test-sukta.html?lang=sa-Latn"  hreflang="sa-Latn" >
<link rel="alternate" href="http://sanskritdocuments.org/pfrs/test-sukta.html?lang=en-IN"  hreflang="sa-Latn-IN" >
<link rel="alternate" href="http://sanskritdocuments.org/pfrs/test-sukta.html?lang=ta"  hreflang="sa-Taml" >
<link rel="alternate" href="http://sanskritdocuments.org/pfrs/test-sukta.html?lang=kn"  hreflang="sa-Knda" >
<link rel="alternate" href="http://sanskritdocuments.org/pfrs/test-sukta.html?lang=te"  hreflang="sa-Telu" >
<link rel="alternate" href="http://sanskritdocuments.org/pfrs/test-sukta.html?lang=ml"  hreflang="sa-Mlym" >
<link rel="alternate" href="http://sanskritdocuments.org/pfrs/test-sukta.html?lang=or"  hreflang="sa-Orya" >
<link rel="alternate" href="http://sanskritdocuments.org/pfrs/test-sukta.html?lang=gu"  hreflang="sa-Gujr" >
<link rel="alternate" href="http://sanskritdocuments.org/pfrs/test-sukta.html?lang=bn"  hreflang="sa-Beng" >
<link rel="alternate" href="http://sanskritdocuments.org/pfrs/test-sukta.html?lang=pa"  hreflang="sa-Guru" >

-------------------------

Nikhilesh, the test-sukta.html page with ?lang=en displays the ITRANS text with the vedic accents. The webpage uses webfnts, so maybe there was a delay in rendering of the page. Please check again. It shows up ok on Chrome on windows 10 - screenshot attached.

Inline image 1

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Bhasha IME

unread,
Apr 12, 2016, 2:01:17 PM4/12/16
to sanskrit-p...@googlegroups.com
namaste ShreeDevi Kumar

The Kannada rendering seems to use Goda font. If so, the latest ver is 1.0.5 on bashaime site.. It has slightly improved positioning of svaras.

There is a problem here. Observe
ವಿ॒ಶ್ವಶ॑ಂಭುವಂ. is not rendering properly.
The svarita should occur after the anusvara. It is so in many places down the doc.
FYKN
regards
Venkatesh

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.

ShreeDevi Kumar

unread,
Apr 12, 2016, 11:48:33 PM4/12/16
to sanskrit-p...@googlegroups.com
Thanks for letting me know about the new version of Goda font. I will make the change. Is there a webfont version available too?

If you know of fonts for other Indian scripts that support the vedic accents correctly, please let me know.

Regards,

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

ShreeDevi Kumar

unread,
Apr 13, 2016, 12:42:29 AM4/13/16
to sanskrit-p...@googlegroups.com, bhas...@gmail.com
The download files for 1.0.5.7z for Goda and Vagisha fonts seem to have viruses as per Windows Defender. Please check and re-upload. Thanks!

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Apr 12, 2016 at 11:31 PM, Bhasha IME <bhas...@gmail.com> wrote:

Bhasha IME

unread,
Apr 13, 2016, 4:52:34 AM4/13/16
to sanskrit-p...@googlegroups.com, Shree Devi Kumar
Obviously false +ve

The file does not have any exe/com at all.

Also uploaded individual font files, for convenience.

From Virustotal:
SHA256: aa3b793431d8796ee01a848851422a3caa9c594b60821076bd6ac26260a96c02
File name: Release_Vagisha_1_0_5.7z
Detection ratio: 0 / 55
Analysis date: 2016-04-13 08:02:49 UTC ( 1 minute ago )


From Virustotal:
SHA256: 3f895ae9c78f99a821ae398236895216f1cdf15f682d9c7be30d202766fc3ed3
File name: Release_Goda_1_0_5.7z
Detection ratio: 0 / 56
Analysis date: 2016-04-13 08:05:36 UTC ( 0 minutes ago )

All files are safe.

regards
Venkatesh




Nikhilesh Jasuja

unread,
Apr 14, 2016, 10:03:52 PM4/14/16
to sanskrit-p...@googlegroups.com
Shreevatsa,

Hreflang is for search engines, not browsers. Google came up with Hreflang to allow webmasters to indicate to search engines where translated content was available. Google's man page on Hreflang (https://support.google.com/webmasters/answer/189077?hl=en ) does NOT say that it follows the same format as the lang attribute on the <html> tag. The Hreflang testing tools out there test for what values search engines will find "valid", because that's the only place hreflang is actually used.

Shree Devi Kumar,
The test page with ITRANS is not working properly on Chrome/Mac. Screenshot here: http://imgur.com/gFJvcy0 


---

Shreevatsa R

unread,
Apr 14, 2016, 10:07:48 PM4/14/16
to sanskrit-programmers
Thanks Nikhilesh! 
Sorry I mistakenly assumed that hreflang was the same as the HTML lang attribute. I didn't realize that its purpose was for Google; things are clearer now... thanks again :-) 

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.

ShreeDevi Kumar

unread,
Apr 15, 2016, 5:07:48 AM4/15/16
to sanskrit-p...@googlegroups.com
Shree Devi Kumar,
The test page with ITRANS is not working properly on Chrome/Mac. Screenshot here: http://imgur.com/gFJvcy0 

​Thank you. I have changed the css to use a different font. Please check that it is ok now.​

Nikhilesh Jasuja

unread,
Apr 15, 2016, 10:00:41 AM4/15/16
to sanskrit-p...@googlegroups.com
Looks good now.

---
www.diffen.com
Diffen. Discern. Decide.

--
Reply all
Reply to author
Forward
0 new messages