Problem with auto-detect for traditinal Chinese

535 views
Skip to first unread message

Venkatesh Venky

unread,
Nov 10, 2015, 1:49:24 AM11/10/15
to Google Translate API Developer Forum

Hi, 

I created an console  application in c# and I'm using Google Translate Rest service. 

I used this URL to send a request in order for auto detection text.




When i tried to send text "本月縱觀國內新車市場,經過了很長一段時間,在新車上市"(traditinal Chinese) to rest service and 
am getting the response as Tranditional Chinese.


Is there any bug with the API or am I doing something wrong?


Thanks,
Venkatesh.B

Nick

unread,
Nov 10, 2015, 2:47:05 PM11/10/15
to Google Translate API Developer Forum
You can specify the source and destination languages, if you don't want to use auto-detect.

Apart from this advice, I'm unable to understand your post, however. Do you mean that you sent the text to the https://www.googleapis.com/language/translate/v2/languages URL? That won't work for sure. Perhaps you could explain better what's happening. What is the exact request you're sending, and what's the response?

Venkatesh Venky

unread,
Nov 11, 2015, 2:06:19 AM11/11/15
to Google Translate API Developer Forum
Hi,

I am sending traditional Chinese to the rest service for auto detection language and the URL is https://www.googleapis.com/language/translate/v2/detect.
but am getting response as "zh-CN". 

and even i tried to send Simplified Chinese to the rest service for language detection and am getting the response as "zh-CN".

Is there any difference between  traditional Chinese and Simplified Chinese language??

Why rest api is giving same response for both Simplified Chinese and Traditional Chinese as  "zh-CN". 

Nick

unread,
Nov 11, 2015, 5:00:51 PM11/11/15
to Google Translate API Developer Forum
Alright, so I'm now reasonably sure that I understand what's happening in this situation, after a bit of reading. My hope is that this post will serve as a signpost for future discussions in regard to Chinese translation using the API, since it contains a lot of information.

Some few things to clarify first:
  • Use of source and target params: In translation, the source and target parameters can be used to explicitly specify the exact languages desired as input/output to the translation algorithm, so a false positive in a language-detection process by the system would be meaningless when you already knew the source language and could explicitly specify it.
  • Identical translation: When translating the phrase you posted with zh-CN and zh-TW, the results are identical:
    • &target=en&source=zh-CN produces:
    • {
          "data": {
              "translations": [
                  {
                      "translatedText": "Looking at the domestic new car market this month, after a long period of time, in the new car market"
                  }
              ]
          }
      }
    • &target=en&source=zh-TW produces:
    • {
          "data": {
              "translations": [
                  {
                      "translatedText": "Looking at the domestic new car market this month, after a long period of time, in the new car market"
                  }
              ]
          }
      }
  • On the use of the specifically-chosen language codes: In the previous link, "zh-TW" maps to "traditional Chinese" while zh-CN maps to "simplified Chinese". See this RFC doc describing how IETF language tags should be composed, which defines how the IANA Language Subtag Registry should function. 
    • If you check in the subtag registry, you'll find that it's possible to specify languages with different levels of granularity using a system of tags which, in brief, can be formatted as {language}-{script}-{region}:
A tag such as "zh-Hans-XQ" conveys a great deal of public, interchangeable information about the language material (that it is Chinese in the simplified Chinese script and is suitable for some geographic region 'XQ'). 
... 
Appendix A. Examples of Language Tags (Informative)
... 
     Language subtag plus Script subtag:
     ...
        zh-Hant (Chinese written using the Traditional Chinese script)
        zh-Hans (Chinese written using the Simplified Chinese script)
(from the RFC)
    •  In the IANA Language Subtag Registry which defines the meaning of "zh", we see the following entry:
Type: language
Subtag: zh
Description: Chinese
Added: 2005-10-16
Scope: macrolanguage
    • In the ISO-3166-1 standard, the two-character country codes such as "CN" or "TW" are defined, with CN mapping to "People's Republic of China" and "TW" mapping to "Taiwan, Province of China". 
    • So, it appears that we're using "Chinese - Taiwan" (zh-TW) to map to "Traditional Chinese", possibly since the traditional script is largely used there, while we use "Chinese - PRC" (zh-CN) to map to "Simplified Chinese", since it is largely used in the PRC. But speculation on my part wouldn't be very useful here, and at any rate the docs are specifying what these codes map to, with no deeper discussion on the exact meaning of the terms.
    • Despite any possible intentions behind this choice, it does seem as though there's a bit of arbitrary choice involved in which IETF language tag to use for "simplified Chinese" or "traditional Chinese", and many different technical organizations other than google make their own choices, for reasons of ambiguity in forming one-to-one mappings of characters, scripts, languages and countries among each other, which are discussed in-depth below:

-------------------------------------------------------------------------------------------

With these points clarified, we can go into the deeper discussion:

As to the language detection returning "zh-CN" (simplified Chinese) despite your analysis that the source is, to your eyes, traditional Chinese (you'd be looking for the API's detection method response to be "zh-TW"), this can be explained with a discussion of the technical problems inherent in detecting a language based on a finite input of text (at least, or especially, for sufficiently-"small" inputs), and some doc links:



Difficulty inherent in detecting source language based on finite UTF-8 input 

When determining whether some input unicode represents simplified or traditional 
Chinese, or even Taiwanese, or any one of the numerous written scripts and language families which share some characters, there isn't necessarily a clean way to determine the language which the text as a whole ultimately amount to, unless given a sufficiently-large and homogeneous input.

In a nutshell, this is because Han characters are not in a 1-to-1 relation with languages that use Han characters. This should be understood not just in context of "Chinese vs. Korean" or "Chinese vs. Japanese", but in this context we see simplified Chinese and traditional Chinese as different languages both making use of the Han script, since for example one grapheme which is valid in traditional Chinese is used in simplified Chinese as a replacement/equivalent in translation for several other traditional graphemes, so it has a dual-membership and meaning in both "languages". 

As part of the process of allocating space in unicode's code-space for languages, a set of characters and character components were defined in common to the Chinese, Japanese and Korean languages - the "CJK Unified Ideographs". This is due to a common historical origin, as the Unicode 8.0.0 spec chapter 18 - East Asia briefly refers to:

The characters that are now called East Asian ideographs, and known as Han ideographs in the Unicode Standard, were developed in China in the second millennium bce. 
... 
As civilizations developed surrounding China, they frequently adapted China’s ideographs for writing their own languages. Japan, Korea, and Vietnam all borrowed and modified Chinese ideographs for their own languages.

So, the characters and character components which need space in the unicode code space allocated for them (as of Unicode 8.0.0, there are 80388 total code points for CJK) for Han-derived/influenced languages are not necessarily unique to one language's script. The process of putting all of these characters/components into a contiguous block in the code space was known as "Han unification", as described by Wikipedia: 

Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the so-called CJK languages into a single set of unified charactersHan characters are a common feature of written Chinese (hanzi), Japanese (kanji), and Korean (hanja).
 
Modern Chinese, Japanese and Korean typefaces typically use regional or historical variants of a given Han character. In the formulation of Unicode, an attempt was made to unify these variants by considering them different glyphs representing the same "grapheme", or orthographic unit, hence, "Han unification", with the resulting character repertoire sometimes contracted to Unihan.
(From Wikipedia - Han Unification)

 
Unicode deals with the issue of simplified and traditional characters as part of the project of Han unification by including code points for each. This was rendered necessary by the fact that the linkage between simplified characters and traditional characters is not one-to-one. 

So we can see that the CJK code block of characters, which are then encoded in UTF-8, does not offer an inherent way of differentiating which language a given character "belongs to". 
The unicode.org FAQ on CJK has this way of explaining the difficulty in determining "language" from finite input, in reference to Korean vs. Chinese:

How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?
 
It's basically impossible and largely meaningless. It's the equivalent of asking if "a" is an English letter or a French one. There are some characters where one can guess based on the source information in the Unihan Database that it's traditional Chinese, simplified Chinese, Japanese, Korean, or Vietnamese, but there are too many exceptions to make this really reliable.
... 
The only proper mechanism is, as for determining whether "chat" is spelled correctly in English or French, is to use a higher-level protocol.
(emphasis mine)

While there are Han characters/components in unicode which aren't part of the CJK Unified Ideographs, that's a more complex question, and you can read about it in this informative blog post.

The important thing to know, as a final crucial point which brings this all into focus on your specific issue, is that simplified Chinese is similar to the other east-asian scripts in terms of using a subset of the CJK space which is not clearly-delineated or exclusive, making detection of language a complex task.



This is where Google Translate comes in

So, what is the "higher-level protocol" which would be capable of doing the work of discriminating between different languages using common characters/components in a given finite sample? That's exactly what the Translate API infrastructure does, behind the API abstraction.

While we don't publish the details of our algorithmic magic, it's pretty good at what it does. It's not perfect, however. With that in mind, we can see why the API's method for language detection can be seen to return not just a result of its best guess, but also "confidence" - a confidence level between 0 and 1 - and "isReliable" - a boolean value which is deprecated and not as useful as "confidence". As you can see when using the "confidence" score to examine the result of detection in your case, the system is very confident that the text represents simplified Chinese:

{ "data": { "detections": [ [ { "language": "zh-CN", "isReliable": false, "confidence": 0.99466604 } ] ] } }

Elsewhere on the internet, discussion abounds on methods to discern between simplified and traditional Chinese characters/text, despite the fact that such a question is not always satisfied by a deterministic yes/no answer:
  • Here is a Stack Overflow Q&A where this is discussed, and has some useful resources as links. 
  • Here's another Q&A where the question comes up. Note that the OP is searching for something which doesn't exist in the unicode code-space: a clearly-delineated block of "only simplified characters", with no traditional characters mixed-in. The answers explain why this isn't as simple as it might seem to the OP. Somebody links to a Ruby library which purports to do this detection, to a level of quality which you will have to test in practice.


In conclusion

It doesn't seem as though there's any issue here beyond the response having far too high a confidence that the input is simplified Chinese. From inspecting the characters in your sample, I can see that several of them are the traditional forms, so it's odd that the system would not use the presence of traditional characters to strongly point towards the detected language being traditional. I'll return to this thread shortly with an explanation or "notification of a fix", if I can manage to produce either of these. For now, I wouldn't worry about the response, since you seem to be quite certain yourself that the source is traditional, and any calls to translation could have their source language specified without machine detection being needed.

Nick

unread,
Dec 16, 2015, 6:36:30 PM12/16/15
to Google Translate API Developer Forum
Thank you again for your feedback on this issue. I'm happy to report that a fix is in place and you'll notice that your example from the first post, "本月縱觀國內新車市場,經過了很長一段時間,在新車上市", will be properly detected as Traditional Chinese: 


{
"data": {
"detections": [
[
{
"language": "zh-TW",
"isReliable": false,
"confidence": 0.9731878
}
]
]
}
}


So, have a nice day, and feel free to reply back if you find anything out of place or that needs fixing.

Sam Elalouf

unread,
Sep 7, 2020, 11:49:55 AM9/7/20
to Google Cloud Translation API
I don't know if anyone is still following this thread but it seems that the same problem has resurfaced and the API's autodetect is incorrectly identifying traditional Chinese text as simplified Chinese. This includes the example ("本月縱觀國內新車市場,經過了很長一段時間,在新車上市") given above. I have even tried submitting strings comprised entirely of traditional characters ("個") and had the same result where the output was still "Detected(lang=zh-CN, confidence=1.0)."

Thank you in advance for any help you can offer!

Konstantin Savenkov

unread,
Sep 9, 2020, 7:56:37 PM9/9/20
to Google Cloud Translation API
Hi Sam,

We use Google's language detection quite a bit, I've just checked - it returns zh-TW for your string. If that matters, we use v3 (https://cloud.google.com/translate/docs/advanced/detecting-language-v3)

cheers,
Konstantin.

Junpei Zhou

unread,
Sep 10, 2020, 9:21:13 AM9/10/20
to Google Cloud Translation API
I tried both V2 and V3 API, they all return correct results as "zh-TW" for the sentence you provided ("本月縱觀國內新車市場,經過了很長一段時間,在新車上市").

- HTTP method and URL:
- Request JSON body:
{ "q": "本月縱觀國內新車市場,經過了很長一段時間,在新車上市" }
- Return info:
{
  "data": {
    "detections": [
      [
        {
          "isReliable": false,
          "language": "zh-TW",
          "confidence": 1
        }
      ]
    ]
  }
}


For V3 API:
- HTTP method and URL:
- Request JSON body:
{content: '本月縱觀國內新車市場,經過了很長一段時間,在新車上市'}
- Return info:
{
  "languages": [
    {
      "languageCode": "zh-TW",
      "confidence": 1
    }
  ]
}


Could you please check the command you used?

Reply all
Reply to author
Forward
0 new messages