I can think of:
1) Percentage of the Internet population covered in native language
(this is what my spreadsheet attempts to do)
2) Percentage of the world population covered in native language
(this doesn't take into account internet penetration, but you could
say "hey, that have a browser waiting for them when they get online")
3) Percentage of the world/internet population covered by any language
they speak
(Not sure where you'd find the right figures to do this)
4) Number of languages covered
(this is what counting packs does, although it runs into trouble with
the definition of "language", and the fact that you give a language
with 10,000 speakers the same weight as one with 100,000,000)
5) Percentage of countries covered, by official language
(This might be a good proxy for method 3, because you would hope that
everyone in a country speaks at least one of the official languages)
Can anyone think of any more?
Gerv
Where do you get reliebla figures for internet and/or world population?
What's the use of such figures? What does it use if you know that the
Sorbian languages cover 0,01% of the world population if I assume 6
milliards people? What's advantage if you know that? IMHO those figures
will be meaningless, they will be too vague.
>
> 4) Number of languages covered
> (this is what counting packs does, although it runs into trouble with
> the definition of "language", and the fact that you give a language
> with 10,000 speakers the same weight as one with 100,000,000)
>
> 5) Percentage of countries covered, by official language
> (This might be a good proxy for method 3, because you would hope that
> everyone in a country speaks at least one of the official languages)
Number of languages you should combine with number of localesto include
varieties of a language.
Why percentage? A percentage has statistical significance only, without
real use. Better is the number of covered countries and not covered
countries combined with locales because in a lot of countries more than
1 language are spoken.
Michael
> Where do you get [reliable] figures for internet and/or world population?
The blog message referenced in the previous subject ("Language Analysis
for FF3") mentions that the figures are taken from CIA World Factbook.
Probably this page: Internet users (by country)
https://www.cia.gov/library/publications/the-world-factbook/fields/2153.html
> What's the use of such figures? What does it use if you know that the
> Sorbian languages cover 0,01% of the world population if I assume 6
> milliards people? What's advantage if you know that? IMHO those figures
> will be meaningless, they will be too vague.
It is one way to get a rough estimate which populations are under served
and represent possible opportunities, to help prioritize future
community-building efforts. (Mozilla Manifesto says Mozilla works for
public benefit; the public includes all populations of the world.)
> Why percentage? A percentage has statistical significance only, without
> real use.
Firefox gets free publicity in the news when it gains market share
percentage points. One way these figures can be used is to find
opportunities where it may gain most market share.
> Better is the number of covered countries and not covered
> countries combined with locales because in a lot of countries more than
> 1 language are spoken.
Better for what purpose?
Maybe a marketing-oriented person likes comparing feature checklists.
Locale checklists without populations seem simpler in that case. But it
doesn't provide much guidance on how to focus resources to fill the
empty check-boxes.
One possible fear: The Mozilla community has limited human resources to
reach out and help support new localization communities, so for the
purpose of allocating time to candidate communities, one consideration
is population. This doesn't mean that the long tail of smaller
communities are to be excluded, but smaller communities may get less
attention.
(One way to allocate is by population: if say five 1% locales are
waiting for reviews and two 2.5% locales don't have localizations yet,
then maybe roughly half the person-hours could be spent on the reviews
and half the person-hours spent on helping a localization community get
started in the under served locales.)
I don't really think that getting more metrics is worthwhile. For status
reports on how our localization coverage is growing, the metrics should
yield more or less similar answers.
If you have more specific questions, it's probably a good idea to pick a
well-suited metric to answer that question, but that will likely be a
different metric for each.
Expect that the data we have might not help in getting answers to those
questions, independent of the metric we use. The data we have might be
just to vague, or changing.
And then there's still Churchill.
Axel
Both of those usually assume people have only one native language, which
applies to the majority of people but the rest might be significant when
looking at 2-10% of the (internet) population not being covered and
trying to find out about differences there.
Of course, counting gets hard when you want or need to respect people
with multiple native languages and still respect what their first choice
is between them.
Robert Kaiser
So you agree that the people you say have multiple native languages
still have a first choice? Then what's the problem? :-)
The last US census (I think) replaced the question about native language
with one about "language spoken at home". Which seems like a better
question, to which (for the vast majority of people) there is just one
answer. Can we in future assume that this is what we mean when we say
"native language"? :-)
Gerv
I think that's clearly not the case.
If we added support for Anfillo, Ainu, Amurdag, Alsatian, Abenaki and
Amanaye, then by the "number of language packs" metric, it would seem
that our coverage has grown by 10%. But by the "% population covered"
metric, it would hardly have grown at all.
(Source: http://en.wikipedia.org/wiki/List_of_endangered_languages)
> If you have more specific questions, it's probably a good idea to pick a
> well-suited metric to answer that question, but that will likely be a
> different metric for each.
Right. So that would be an argument for having good figures for several
different metrics.
> Expect that the data we have might not help in getting answers to those
> questions, independent of the metric we use. The data we have might be
> just to vague, or changing.
That's possible, but requires proof. If you have issues with the data
I'm using, please raise them :-)
Gerv
So "less similar" in this case. Btw, "Population covered" is not going
to change significantly anymore, "population not covered" might, though.
>> If you have more specific questions, it's probably a good idea to pick a
>> well-suited metric to answer that question, but that will likely be a
>> different metric for each.
>
> Right. So that would be an argument for having good figures for several
> different metrics.
>
>> Expect that the data we have might not help in getting answers to those
>> questions, independent of the metric we use. The data we have might be
>> just to vague, or changing.
>
> That's possible, but requires proof. If you have issues with the data
> I'm using, please raise them :-)
Do you have errors for your data? Time they were measured? Trends? And
errors in trends? If we had all that, take your metric, and do some
happy propagation of uncertainty,
http://en.wikipedia.org/wiki/Propagation_of_uncertainty.
In general, I'd expect the errors to be larger for smaller languages
than for big ones, bigger for poorer countries than for thoroughly
industrialized (or post-industrialized) ones.
Axel
If we take for example the Thai language; there is no native language
build for F3 at mozilla.com. There is a http://www.firefoxthai.com/
for F2 (which uses the Firefox logos) but no indication for Firefox3.
What I would think is useful, is statistics that show the percentage
of the population of Thailand that speaks only Thai, and cannot find a
suitable Firefox 3.
For the purposes of L10n, it would be good to see which populations
(sorted by size) are affected by the lack of a version of F3 that is
not available in any language they speak/read.
Simos
> _______________________________________________
> dev-l10n mailing list
> dev-...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-l10n
>
This is my method (3). The problem is that I don't know of a source of
data which can provide the necessary figures.
For a given country, you would need to know data something like this:
UK: All Languages Spoken:
English only: 47.3%
English and Bengali: 4.3%
Bengali only: 0.14%
English and Polish: 1.6%
Polish, Latvian and Russian: 0.006%
...
Then, you could look at the languages and go through ticking off the
groups for which you had at least one hit.
It would be a very long and very detailed list. I don't know if such
data is available even for countries with very good censuses.
Gerv
:-) "More or less similar" is an idiom; it doesn't mean "either more
similar or less similar", it means "quite similar".
> Btw, "Population covered" is not going
> to change significantly anymore, "population not covered" might, though.
I don't understand what you mean by that. If "population covered" +
"population not covered" = 100%, how can "population covered" not change
if "population not covered" changes?
> Do you have errors for your data? Time they were measured? Trends? And
> errors in trends? If we had all that, take your metric, and do some
> happy propagation of uncertainty,
> http://en.wikipedia.org/wiki/Propagation_of_uncertainty.
>
> In general, I'd expect the errors to be larger for smaller languages
> than for big ones,
But such errors have less effect on the overall result, because the
absolute numbers are smaller. If I say the population whose native
language is Alsatian is 10,000, when in fact it's 20,000, that's not
going to produce a noticeable error when the internet population is
about 1 billion.
> bigger for poorer countries than for thoroughly
> industrialized (or post-industrialized) ones.
You are probably right there.
But the question is: if your data is not perfect, do you just give up,
or do you work with the data you have? When I said "If you have issues",
I didn't mean "List all of the statistical problems it might have", I
meant "provide better data if you have some, otherwise let's go with
what we've got".
Gerv
For example, for Thailand, the page is
http://www.ethnologue.com/show_country.asp?name=TH
which says that the vast majority of the population (>93%) speaks Thai
or dialects of Thai.
Simos
To me, it's equivalent norms, which is, when one grows, the other grows
and vice versa.
>> Btw, "Population covered" is not going
>> to change significantly anymore, "population not covered" might, though.
>
> I don't understand what you mean by that. If "population covered" +
> "population not covered" = 100%, how can "population covered" not change
> if "population not covered" changes?
If you count in %, both have to change.
On the other hand, say we take 90% for one and 10% for the other, an
additional language with 2% changes one by a mere 2.2%, which it changes
the other by a whopping 20%.
Thus, relatively, covered population is hardly going to change by any
language we get, uncovered population on the other hand might.
>> Do you have errors for your data? Time they were measured? Trends? And
>> errors in trends? If we had all that, take your metric, and do some
>> happy propagation of uncertainty,
>> http://en.wikipedia.org/wiki/Propagation_of_uncertainty.
>>
>> In general, I'd expect the errors to be larger for smaller languages
>> than for big ones,
>
> But such errors have less effect on the overall result, because the
> absolute numbers are smaller. If I say the population whose native
> language is Alsatian is 10,000, when in fact it's 20,000, that's not
> going to produce a noticeable error when the internet population is
> about 1 billion.
It does when you start comparing it to other small languages. Which you
to some extent did in your blog post.
>> bigger for poorer countries than for thoroughly
>> industrialized (or post-industrialized) ones.
>
> You are probably right there.
>
> But the question is: if your data is not perfect, do you just give up,
> or do you work with the data you have? When I said "If you have issues",
> I didn't mean "List all of the statistical problems it might have", I
> meant "provide better data if you have some, otherwise let's go with
> what we've got".
Neither. You have to ask appropriate questions for your data to answer,
and if it's really noisy or uncertain data, you have to ask questions
that work well with fuzzy answers.
For example "Which language should Microsoft do next?" is likely a bogus
question, given that you answer was Balochi. That is probably affected
by each and every discaimer you gave on the assumptions you made, and
has likely uncertain initial data, too. I'm running on the assumption
that the next runner up wasn't at just 0.10%.
Axel
The problem is that you can reach them equally well in all of those
languages, but they still prefer one. So if you ask "how many people do
we reach?" you only have to serve them any of those languages (e.g.
Sorbs, who all grow up bilingually). That e.g. means you can wipe out
minority languages whose native speakers are all bilingual from your
statistics (like Sorbian, or I guess also Gaelic), and you can wipe out
language variants, as Canadians, British, Irish, and South African
people probably can all be reached with en-US.
If you want to serve them in their preferred language, then you need to
look into variants as well as minority languages whose speakers are all
bilingual, and the picture gets both more difficult but also more
interesting.
So, the big problem is that you can't say "Microsoft doesn't reach
Sorbs" just because they don't offer Sorbian, as they reach them pretty
well with German (probably as well as one reaches Brits with en-US,
actually). What you can say is that we serve Sorbs better by offering
that language than by German, but we actually reach them with both.
> The last US census (I think) replaced the question about native language
> with one about "language spoken at home". Which seems like a better
> question, to which (for the vast majority of people) there is just one
> answer. Can we in future assume that this is what we mean when we say
> "native language"? :-)
So, you mean French _and_ German _and_ their Austrian dialect for a
friend of mine, as she speaks all three of them at home? ;-)
Robert Kaiser
That is, I think, because I worked that out by eye and I can't count.
The right answer is Belarusian.
Gerv
I think Axel and Pascal would eat me alive if I tried that.
> and you can wipe out
> language variants, as Canadians, British, Irish, and South African
> people probably can all be reached with en-US.
I did make that optimisation.
> If you want to serve them in their preferred language, then you need to
> look into variants as well as minority languages whose speakers are all
> bilingual, and the picture gets both more difficult but also more
> interesting.
I think that if you talk about preferred language, the picture gets less
difficult. "What language do you speak at home?" is a question that
anyone can answer, and many countries have stats for. "What are all the
languages you speak?" is not a question I've found data for, for any
country.
> So, the big problem is that you can't say "Microsoft doesn't reach
> Sorbs" just because they don't offer Sorbian, as they reach them pretty
> well with German (probably as well as one reaches Brits with en-US,
> actually).
Quite so. Which is why I don't say that Microsoft doesn't reach Sorbs. :-)
> What you can say is that we serve Sorbs better by offering
> that language than by German, but we actually reach them with both.
That is true. We are again back to my point that I am putting 90% and
100% solutions in the same basket and contrasting them with the 0% solution.
> So, you mean French _and_ German _and_ their Austrian dialect for a
> friend of mine, as she speaks all three of them at home? ;-)
:-P
Gerv
Right. But it doesn't tell us what you wanted to know, which was
"statistics that show the percentage of the population of Thailand that
speaks *only* Thai".
Gerv
Yes, and I know a German who speaks Upper Sorbian, Lower Sorbian,
Esperanto and Lithuanian. He is professor for Sorabistics in Lipsia.
AFAIK he speaks with wife Lithuanian at home though she is esperantist
as well. :-)
Michael
Huh? Anyway, looking at the snapshot, MS neither has bal nor be, which
you attribute 0.511 and 0.452 % to, resp., which is a 13% difference.
10-ish % seems to be *very* low as an error bar, so I don't see how your
data should make a call on whether it'd be bal or be.
Thus, "Which locale should Microsoft do next" is an ill-posed question
for the data you have.
Axel
Try the new one. There was a bug: bal is now 0.047%, and be is 0.450% -
a factor of 10 difference.
Next after Belarusian is Oriya, 5478000 vs. 2085318 - so more than half.
With the data I have, Belarusian is the clear answer. Of course, if you
want to improve the data, that could possibly change. :-)
Gerv
Is the issue that we do not know how many people in Thailand speak
English, apart from speaking Thai? Or, is the issue, how many people
in Thailand speak Thai, but no any other minority languages?
In both questions above, my view is that if we are to go into those
directions, we miss the point. The point is that in the example of
Thailand, over 93% of the people speak Thai (and other speak probably
Thai + some other minority language). Considering that Thailand has a
population of 65 million people, we have a chunk of over 60 million
people with no Thai Firefox 3.
Simos
You are moving backwards and forwards between two positions :-)
To recap: there are two ways that it's possible to count. From the point
of view of the user:
1) "Firefox is supported in at least one of the languages I speak."
This is the data you originally asked about. This would produce higher
percentages because e.g. at the moment, we have to say that we don't
serve any Thai people because we don't have an official localization in
Thai, but if we used this method, we could say we served all the Thais
who also speak English. But my point is that there is no data available
that gives us a list of all the languages that each person in Thailand
speaks. So we can't calculate this figure.
2) "Firefox is supported in my native (spoken-at-home) language."
It seems to me that this is what you have switched to talking about in
your message above. This is what my spreadsheet shows us. And it does
indeed tell us that there are 8.57 million people (net population, not
total population) in Thailand with no Firefox 3 in their native Thai
language.
Gerv
You just said that you are measuring the same thing differently for
those, so your measurement is inconsistent. :p
> I think that if you talk about preferred language, the picture gets less
> difficult. "What language do you speak at home?" is a question that
> anyone can answer, and many countries have stats for. "What are all the
> languages you speak?" is not a question I've found data for, for any
> country.
What about ethnologue.com?
At least http://www.ethnologue.com/show_country.asp?name=AT clearly
shows for me that most of the 8,174,762 people in Austria natively speak
both Bavarian (6,983,298) and Standard German (7,500,000) - of course it
doesn't tell what part natively speaks only one of those languages that
are listed on the page, or which two (or three) of them.
(Note that in the case of the those two languages, both are variants if
German, we cover all those people with our Standard German L10n
currently, as all speakers of Bavarian, Alemannisch and Walser at least
learn Standard German as a child.)
And the statistics you mean are about "What language do you prefer to
speak at home?" as the question "What language do you speak at home?"
would come out with multiple languages for many people and leading to
the more difficult picture.
Robert Kaiser
I don't think so. Saying that an en-US speaker and an en-GB speaker
speak the same language is not the same thing as saying that someone
whose first language is Hindi and whose second language is English can
be served with an English build (for example).
>> I think that if you talk about preferred language, the picture gets less
>> difficult. "What language do you speak at home?" is a question that
>> anyone can answer, and many countries have stats for. "What are all the
>> languages you speak?" is not a question I've found data for, for any
>> country.
>
> What about ethnologue.com?
>
> At least http://www.ethnologue.com/show_country.asp?name=AT clearly
> shows for me that most of the 8,174,762 people in Austria natively speak
> both Bavarian (6,983,298) and Standard German (7,500,000) - of course it
> doesn't tell what part natively speaks only one of those languages that
> are listed on the page, or which two (or three) of them.
Exactly. And this is precisely the information we would need.
> And the statistics you mean are about "What language do you prefer to
> speak at home?" as the question "What language do you speak at home?"
> would come out with multiple languages for many people and leading to
> the more difficult picture.
I don't agree. I think "language spoken at home" is a single language
for 99.9% of people.
Gerv
What might be much harder but which we've used a bit in South Africa is
to look at language groups. Xhosa, Zulu, Ndebele and Swati are all
Nguni languages. There are 20 mill Zulu speakers but a Zulu Firefox
could be used and understood by the other +- 15million who speak another
Nguni language.
But getting data on this is hard.
> 2) Percentage of the world population covered in native language
> (this doesn't take into account internet penetration, but you could
> say "hey, that have a browser waiting for them when they get online")
Considering the cellphone revolution in certainly Africa, this figure
does make sense.
> 3) Percentage of the world/internet population covered by any language
> they speak
> (Not sure where you'd find the right figures to do this)
>
> 4) Number of languages covered
> (this is what counting packs does, although it runs into trouble with
> the definition of "language", and the fact that you give a language
> with 10,000 speakers the same weight as one with 100,000,000)
I think this count still gives an indication of how our process can
scale, the power of openness, etc. But I think it really only makes
sense when compared to other browsers and we can show change over time.
> 5) Percentage of countries covered, by official language
> (This might be a good proxy for method 3, because you would hope that
> everyone in a country speaks at least one of the official languages)
It is also a great influencer for adoption within Government and
education so a good measure. Also relatively easy to administer as
official languages are pretty static. This can be deceptive though in
certain countries where official languages are not reflective of the
spoken languages in the country and may cover a small minority of the
country.
--
Dwayne Bailey
Translate.org.za
+27-12-460-1095 (w)
+27-83-443-7114 (cell)
I think it's more between 80 and 90% than over 99%.
Robert Kaiser
10-20% of people in the world have parents who a) speak different
languages, and b) haven't agreed on a single language to use at home? I
really don't think that's true.
Gerv
But I do. And it doesn't mean parents need to speak different languages.
Didn't you learn from the example of Sorbs?
Robert Kaiser
I think we are talking about different things.
When I talk about "native language", I am not talking about bilingualism
(where people have the ability to speak more than one language). That
is, I agree, fairly common.
If the Sorbs speak Sorbian with their parents, then that's their
"language spoken at home". If they speak German with their parents, then
that's their "language spoken at home".
What I am talking about is the language that people speak with their
family in the house where they live ("language spoken at home"). My
assertion is that in the vast majority of cases (99%+), people only
speak one language with their parents. This may or may not be the same
language as the one they speak with the rest of the country.
For example, a Polish family moves to England. They still speak Polish
at home, even though they speak English with almost everyone else.
Another example: A German man falls in love with a French woman. As the
French woman's German is better than the German man's French, they speak
to each other in German. She moves to Germany and they get married, and
have children. They are most likely to speak predominantly one of German
or French (in this case, probably German) at home. They aren't going to
speak German on Mondays, French on Tuesdays, and so on.
Actually, using a European example is a little bit misleading, because
this sort of marriage is much more common in Europe than anywhere else
(lots of close-together countries with different languages and high
mobility.)
Gerv
Sure, but they're probably using all their software in German (maybe
with the amazing exception of Firefox), and they don't have any problem
whatsoever with that. It exactly the same as a Brit using an en-US piece
of software - but you treat those cases differently.
Robert Kaiser
Yes, I offer:
Upper Sorbian:
- Firefox
- Seamonkey
- Thunderbird
- Sunbird/Lightning
- KompoZer/Nvu
- some add-ons
Lower Sorbian:
- Firefox
Besides there is a project to localize KDE, but I'm not participating in
this.
The situation about spoken language at home is different:
There are:
- pure Sorbian families where Sorbian is spoken
- mixed families where is spoken Sorbian or German or both
- German families? and/or German individuals which feel Sorbs and speak
Sorbian or/and German
The issue is some Sorbs don't understand Sorbian, especially those who
are living in the neigbourhood of a German majority.
In the Catholic region east of the town Kamenz most Sorbian people are
living, therefore Sorbian is there the mostly spoken language.
Michael
I don't think this is quite the same. All US English speakers can
understand UK English, and vice versa, because they are the same
language. It's not necessarily true that all Sorbian speakers can
understand German (although it may be 99% true in practice), because
they are different languages.
Let's say I was to split up en-US, en-GB and en-ZA. Into which of the
three columns do I put English speakers from India? Hong Kong?
Australia? Canada?
Microsoft only ship one English build. Do I mark them as supporting only
en-US, or all three?
Gerv
This might be true for Sorbs in the pre-school age (but there are day
nurseries as well where they learn German) but at the latest at school
they learn German.
>
> Let's say I was to split up en-US, en-GB and en-ZA. Into which of the
> three columns do I put English speakers from India? Hong Kong?
> Australia? Canada?
>
> Microsoft only ship one English build. Do I mark them as supporting only
> en-US, or all three?
I think you should mark this as one language variety and positively
emphasize that Firefox support more of them. You should emphasize that
Firefox/Mozilla meets the claims of its users better. You should give a
reader of the statistics the positive impression about the diversity of
the localization support that Firefox/Mozilla is offering.
Michael
Haven't been following these language stat's threads thoroughly through
the last few days, but wouldn't having at least the few "raw" data
columns be rather helpful?
That would be language, locale, and language+locale. From that, people
could at least make their own educated guesses. If I look around me
counting only the immediate vicinity where I am in Sweden, I have Farsì,
Arabic, Spanish, Finnish, Danish, Norwegian, German, French, Armenian,
Azeri, Russian, Polish, English, Icelandic, Estonian, Hungarian,
Mandarin, etc. etc. besides sv, which is for all practical purposes the
same in SE and FI.
Being multilingual is the norm nowadays, and living outside a country
where your language is "at home" is increasingly common.
Trying to cover all would at best give rough estimates. Except the
linguistics departments of some universities, some bible translation
site(s) may perhaps give fairly accurate and *practically usable*
figures for how many speakers any given tongue has. (Should we count the
natively German speakers in e. g. Gramado and other places in Rio Grande
do Sul in Brazil into the figures for German...? They're into the fifth
generation or so speaking German without "touching home base" in
Germany, Austria etc. And which others?).
BR,
Gudmund
Given the lack of any real Grammar, we should just dismiss English
alltogether ;-)
Axel