Proposal: Locale encoding in URLs

27 views
Skip to first unread message

Arun Prasad

unread,
Sep 12, 2022, 8:40:24 PM9/12/22
to ambuda-discuss
What is this?
A proposal for how to encode locales in Ambuda's URLs. I'm sharing it for feedback from the group.


Why do this?

We want to make Ambuda avaible in a variety of different languages. (More technically, we could call these locales, which refer to a language-region pair. But for our purposes, a language and a locale are usually synonymous.)

It's useful to encode a locale in a page's URL. By doing so, we:

- give our users stable URLs that they can share with other people in their language community.
- help search engines index multi-lingual content.
- better indicate to users that our site is multi-lingual.
- improve our site's usability by giving users an obvious way to change the site language.


What options are there?

Broadly, there are four ways to encode this information in the URL:
1. By using country-specific toplevel domains (google.de, google.fr, ...)
2. By using subdomains (en.duolingo.com, fr.duolingo.com, ...)
3. By using subdirectories (apple.com/en, apple.com/fr, ...)
4. By using URL parameters or session cookies (ambuda.org?lang=fr, ...)

(1) is expensive since we need to buy each domain separately. It also doesn't fit well for India, where a single .in country code has at least dozens of associated languages. (4) is discouraged by Google, and it is much clumsier technically than (2) and (3).


Why I think we should use subdirectories
I prefer subdomains (en.ambuda.org), but I think we should use subdirectories (ambuda.org/en).

I prefer subdomains for a variety of reasons:
- They display nicely on mobile browsers: the user knows that they're on en.ambuda.org specifically.
- They make the URL more navigable for common use cases. A user who wants to return to the main page can just replace the URL path with `/` as opposed to `/en/`.
- They don't have an SEO penalty. Google SEO is flexible if we provide the right sitemap, and they recommend subdomains along with subdirectories in their docs.
- I think they look nicer.

But here's why I think we should use subdirectories:
- They're much simpler to support in the dev environment.
- They have a 1:1 mapping to subdomains, so it's easy to convert one structure to another if we ever choose to.
- If we ever choose to, we can migrate cheaply by adding redirects to the site.


Proposal

I think ambuda.org/ should be a language selection page analogous to www.wikipedia.org. All content in English (for example) will be under ambuda.org/en/. So that means we have ambuda.org/en/proofing, ambuda.org/en/texts/, etc.

The splash page can prioritize Sanskrit and English. We can show other languages in their alphabetical order as done on a site like Wikipedia.

Other schemes I considered:
- We could list all languages alphabetically, but some languages are going to be much more common than others. In particular, a simple alphabetical sort will list all Western languages before all Indian ones, which is poor UX given that roughly 60% of our users are from India.

- We could split languages into two categories ("Indian" and "International"), but that raises further questions. For example, Urdu is a scheduled language in India, but it is also the national language of Pakistan. With an Indian/International scheme, it's not obvious how to list Urdu.

Arun Prasad

unread,
Sep 12, 2022, 8:44:49 PM9/12/22
to ambuda-discuss
Oh, one thing I should clarify is that we specifically want to support search in multiple scripts, since Google doesn't recognize that the same text in two different scripts are the same content. That is, if you search Google for Sanskrit text in Telugu script, you'll get no Devanagari results.

Vishvas Vasuki (Vishvas)

unread,
Sep 12, 2022, 9:37:54 PM9/12/22
to ambuda-discuss
On Tuesday, 13 September, 2022 at 6:14:49 am UTC+5:30 Arun Prasad wrote:
Oh, one thing I should clarify is that we specifically want to support search in multiple scripts, since Google doesn't recognize that the same text in two different scripts are the same content. That is, if you search Google for Sanskrit text in Telugu script, you'll get no Devanagari results.

The language-script distinction is important to make. Many people prefer reading sanskrit in different script, yes. But this is the case with several other languages. (I prefer to read tamiL and maNipravALa in scripts I'm familiar with.) Furthermore, ambuda is available in multiple *languages*, not just scripts. Of course, one can have the concept of a default script, so that specifying just the language code should suffice in most cases.
 
So, subdomains or subdirs - I strongly suggest that you go for a standard language-script pairing. That is, language (in ISO 639-1 format) and optionally a region (in ISO 3166-1 Alpha 2 format, script itself explicitly using ISO 15924.

As an aside, if you visit view-source:https://vishvasa.github.io/purANam/mahAbhAratam/goraxapura-pAThaH/hindy-anuvAdaH/01_Adiparva/02_parvasaMgrahaparva/002_parvasangrahaparva/ , you will see several hreflang, for example:


<link rel="alternate" hreflang="sa-Knda" href="https://vishvAsa.github.io/purANam/mahAbhAratam/goraxapura-pAThaH/hindy-anuvAdaH/01_Adiparva/02_parvasaMgrahaparva/002_parvasangrahaparva/?transliteration_target=kannada" />
<link rel="alternate" hreflang="sa-Mlym" href="https://vishvAsa.github.io/purANam/mahAbhAratam/goraxapura-pAThaH/hindy-anuvAdaH/01_Adiparva/02_parvasaMgrahaparva/002_parvasangrahaparva/?transliteration_target=malayalam" />
<link rel="alternate" hreflang="sa-Telu" href="https://vishvAsa.github.io/purANam/mahAbhAratam/goraxapura-pAThaH/hindy-anuvAdaH/01_Adiparva/02_parvasaMgrahaparva/002_parvasangrahaparva/?transliteration_target=telugu" />
<link rel="alternate" hreflang="sa-Taml-t-sa-Taml-m0-superscript" href="https://vishvAsa.github.io/purANam/mahAbhAratam/goraxapura-pAThaH/hindy-anuvAdaH/01_Adiparva/02_parvasaMgrahaparva/002_parvasangrahaparva/?transliteration_target=tamil_superscripted" />
<link rel="alternate" hreflang="sa-Taml" href="https://vishvAsa.github.io/purANam/mahAbhAratam/goraxapura-pAThaH/hindy-anuvAdaH/01_Adiparva/02_parvasaMgrahaparva/002_parvasangrahaparva/?transliteration_target=tamil" />
<link rel="alternate" hreflang="sa-Gran" href="https://vishvAsa.github.io/purANam/mahAbhAratam/goraxapura-pAThaH/hindy-anuvAdaH/01_Adiparva/02_parvasaMgrahaparva/002_parvasangrahaparva/?transliteration_target=grantha" />

They all failed (Not withstanding https://developers.google.com/search/docs/advanced/crawling/localized-versions?visit_id=637986294738661079-3517741230&rd=1 ), because javascript does the transliteration, and Google doesn't care to index them properly ("They're all duplicates" it says). So, all translation/ transliteration for ambuda should happen server-side.


As a user, I strongly prefer subdomains. Editing a  url on phone less hairier that way, I think.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Sep 14, 2022, 11:38:53 AM9/14/22
to ambuda-discuss
आनुषङ्गिकसम्बन्धः - https://unicode-org.atlassian.net/browse/CLDR-13444

--
You received this message because you are subscribed to a topic in the Google Groups "ambuda-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ambuda-discuss/LfS-2_pv0bA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ambuda-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ambuda-discuss/497ae8f6-4e85-4790-a085-12448f15851en%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
--
Vishvas /विश्वासः

Reply all
Reply to author
Forward
0 new messages