If you work on language processing or i18n (internationalization) involving Chinese, it’s helpful to have a basic understanding of what “Chinese” is and how it works. Before beginning, I’d like to set up a few assumptions:
Chinese isn’t a language. It’s a macrolanguage, which for practical purposes means it’s a language family [source]. The Chinese languages aren’t mutually intelligible.
Chinese isn’t a written language. The Chinese languages aren’t mutually intelligible in written form either.
Chinese characters are not a language. Chinese characters form a writing system which is independent of the language being written. Just as the Latin alphabet is used to write many European languages, Chinese characters are used to write many Chinese languages. But Chinese languages don’t have to be written in Chinese characters, and some non-Chinese languages use Chinese characters for writing.
With these assumptions set up, we’ll start with the spoken languages, which are more straightforward, before going over the written languages.
1 Spoken Languages
1.1 Language List
The Chinese macrolanguage, or the Sinitic languages, is a language family with internal diversity comparable to the Romance language family’s [source]. According to the International Organization for Standardization (ISO 639-3), it includes 16 languages, each with unique grammar, vocabulary, and pronunciation [source]. This is the list of languages from ISO, along with their Mandarin names and some notable varieties:
- Gan (贛語 gànyǔ)
- Hakka (客家語 kèjiāyǔ)
- Hui (徽語 huīyǔ)
- Jin (晉語 jìnyǔ)
- Literary Chinese (文言文 wényánwén)
- Mandarin (官話 guānhuà)
- Standard form: Standard Chinese (華語 huáyǔ, 普通話 pǔtōnghuà, or 國語 guóyǔ)
- Notable variety: Sichuanese (四川話 sìchuānhuà)
- Northern Min (閩北語 mǐnběiyǔ)
- Northern Ping (桂北平話 guìběi pínghuà)
- Eastern Min (閩東語 mǐndōngyǔ)
- Southern Min (閩南語 mǐnnányǔ)
- Standard form: Taiwanese (台語 táiyǔ)
- Notable variety: Teochew (潮州話 cháozhōuhuà)
- Southern Ping (桂南平話 guìnán pínghuà)
- Central Min (閩中語 mǐnzhōngyǔ)
- Puxian (莆仙語 púxiānyǔ)
- Wu (吳語 wúyǔ)
- Notable variety: Shanghainese (上海話 shànghǎihuà)
- Xiang (湘語 xiāngyǔ)
- Yue (粵語 yuèyǔ)
- Standard form: Cantonese (廣東話 guǎngdōnghuà)
- Notable variety: Taishanese (台山話 táishānhuà)
1.2 Standard Chinese
Due to Mandarin’s official status, its standard variety is simply known as “Standard Chinese” [source]. An unqualified reference to “Chinese” as a single language typically refers to Standard Chinese, which is a semi-artificial Mandarin variety defined as having (1) the pronunciation system of Beijing Mandarin, (2) a vocabulary based on all Mandarin dialects, and (3) a grammar based on “written vernacular Chinese” (白話文 báihuàwén), a written Mandarin variety developed in late imperial China [source]. Because of this definition, it isn’t strictly correct to equate Beijing Mandarin to Standard Chinese—they differ in vocabulary and grammar.
Each of the languages has its own set of dialects. Sichuanese, Teochew, and Taishanese, listed above, are varieties of Mandarin, Southern Min, and Yue, respectively, that are notable for being almost mutually unintelligible with the standard form of their language. Although the ISO list doesn’t consider them separate languages, a strong argument could be made for it. The distinction between language and dialect isn’t always clear [source].
Note that the tendency in English to call the Sinitic languages “dialects” may be the result of politics or perhaps a questionable translation of the term 方言 fāngyán (“regional speech”). In practice, they should not be treated as such.
1.4 Literary Pronunciations
Sinitic languages which are only distantly related to Standard Chinese tend to exhibit a phenomenon known as literary pronunciations, where each character has two pronunciations, a native one (白讀 báidú) which may or may not be Sinitic in origin and a literary one (文讀 wéndú) borrowed from Standard Chinese or an ancestor of Standard Chinese [source]. Usually, the native pronunciation is used in native compounds while the literary pronunciation is used in compounds borrowed from Standard Chinese or Literary Chinese. For example, for the character 人 in Taiwanese:
|Native: 人 lâng||Native: 囡仔人 gín-á-lâng “child”|
|Literary: 人 jîn||Borrowed: 人民 jîn-bîn “citizens”|
But this isn’t always the case, as evidenced by the existence of mixed-pronunciation words such as 各人 kok-lâng “each person”, where 各 kok is literary but 人 lâng is native [source]. Outside of compounds, usage can even seem arbitrary. The phenomenon occurs to a limited extent in Beijing Mandarin (e.g., 薄 bó vs. 薄 báo) because Mandarin was previously standardized based on Nanjing pronunciation, but typically one pronunciation is dominant or the pronunciations are interchangeable.
Literary and native pronunciations in Sinitic languages are analogous to Japanese’s native (訓読み kunyomi) and borrowed (音読み onyomi) pronunciation systems and introduce difficulties when processing text to speech or phonetic transcriptions. As of this writing there is no algorithm to perfectly determine the in-context pronunciation of a Chinese character in Japanese or a Sinitic language with a substantial literary pronunciation system [source].
2 Written Languages
An unqualified reference to “Chinese” as a written language (中文 zhōngwén) usually refers to written Mandarin. In antiquity, it would have referred to Literary Chinese (文言文 wényánwén), but that language sees limited usage now. Sinitic languages don’t share the same written language, but they usually use Chinese characters as their writing system. Compare English and Spanish, which share the same writing system—the Latin alphabet—but not the same written language.
2.1 Mutual Intelligibility
Prose in different Sinitic languages is mutually unintelligible. But because Chinese characters are logographic, a speaker of one Sinitic language is likely to, to a somewhat higher degree in written form than in spoken form, be able to understand a different Sinitic language. Full comprehension is of course not possible without prior knowledge of the other language [source]. This comes into play especially with nouns, many of which are written the same way across Sinitic languages—just as an English speaker may be able to extract some meaning from a Spanish text based on cognates, but won’t have full comprehension of the text.
2.2 Traditional and Simplified Characters
Traditional and Simplified Chinese characters are variations of the writing system, not of a language. Any Sinitic language can be written in either system, but in practice the writing system used is related to where the language is spoken or standardized. Cantonese, Taiwanese, and Hakka are standardized in areas where Traditional Chinese characters are used, so they are usually written in Traditional Chinese characters. Mandarin and Literary Chinese are written in either, and others are written in Simplified Chinese characters.
In language processing and i18n, sometimes “Chinese (Simplified)” is treated as one language and “Chinese (Traditional)” is treated as another. These are of course not languages; here, either only the writing system is being changed, or more likely, the writing system is being used as a proxy to differentiate between mainland Chinese Mandarin and Taiwanese Mandarin (not to be confused with Taiwanese, a standard form of Southern Min). One must be careful with this assumption however, as Simplified Chinese text isn’t necessarily mainland Chinese Mandarin, and Traditional Chinese text isn’t necessarily Taiwanese Mandarin.
2.3 Mixed Language Systems
Mandarin and (historically) Literary Chinese have the most well-established and widely used written forms, but Cantonese also has a written form standardized in Hong Kong, and Taiwanese and Hakka have written forms standardized in Taiwan. These are mainly used in contexts that require precise transcription of spoken words such as legal records, or in informal contexts, such as on the Internet. Most formal writing is done in Mandarin instead.
Educated speakers of any Sinitic language are able to read and write in Mandarin, even if they don’t speak it. It is common in some Sinitic languages (most notably Cantonese) to read Mandarin texts aloud using the individual character pronunciations from the reader’s language [source]. This is comparable to an English speaker knowing how to read and write Spanish but not knowing how to pronounce it, and thus reading Spanish aloud by looking at the spelling and reading letters as if they were the same as in English. Compare the anglicized pronunciation of “Los Ángeles” with its actual Spanish pronunciation.
The result is a somewhat bastardized mixed language form where the grammar and vocabulary are from one language but the pronunciation system is from another language. With English and Spanish, this is usually unacceptable (with exceptions such as the one mentioned above), but for Sinitic languages it has been conventionalized, leading to the misconception that the Sinitic languages share a single written language, when in fact the language being written is usually Mandarin. Here is an example of written language usage:
|Mandarin||「不要動！Bú yào dòng!」|
|Mandarin with Cantonese pronunciations||「不要動！Bat1 jiu3 dung6!」|
|Mandarin with Taiwanese pronunciations (unusual)||「不要動！Put iàu tōng!」|
|Native Cantonese||「唔好郁！M4 hou2 juk1!」|
|Native Taiwanese||「莫振動！Mài tín-tāng!」|
Such a mixed language system is in fact necessary when reading Literary Chinese, which is an archaic language with a distinct vocabulary and grammar but no known pronunciation system. Thus, to read Literary Chinese aloud, one must borrow the pronunciation system of an extant language, such as a Sinitic language or other language that assigns pronunciations to Chinese characters, such as Japanese or Korean [source]. Of course, the resulting speech wouldn’t make sense to speakers of that language without prior study of Literary Chinese.
Chinese isn’t a single language, and non-Mandarin issues such as literary pronunciations and mixed language systems should not be forgotten about. Variations in Chinese characters as a writing system go beyond simplified vs. traditional, as many characters also have orthodox, shinjitai, and other variants that have to be used in specific contexts. Ken Lunde’s book CJKV Information Processing is a useful resource on how computers represent and handle CJKV (Chinese, Japanese, Korean, and Vietnamese) character variants. If there’s one thing that I hope those doing language processing or i18n can take away from this article, it’s that addressing “Chinese (Simplified)” and “Chinese (Traditional)” is not addressing all Sinitic languages.