As wordlingo.com explains, localization (the US spelling seems to be dominant across varieties of English) is
[t]he process of adapting text and cultural content to specific target audiences in specific locations. The process of localization is much broader than just the linguistic process of translation. Cultural, content and technical issues must also be taken into account.
Since trying to give a hand making the WordPress blogging software useable for multilingual blogs, I have been running into the difficulties of this process.
Internationalising a blog is not the same as localizing software, though. I have written more on this on the palimpsest wiki.
A commonly used tool in the world of free software is gettext
. Its approach — extract strings in the language the software was originally written for and substitute text in the target language(s) — sounds reasonable and straightforward. Until you try to use it, that is. Via LaugingMeme, I found a detailled account on the shorcomings of gettext
by Sean M. Burke and Jordan Lachler: a “localization horror story” about the simple task of translating the program alerts “I scanned N directories” and “Your query matched N files in M directories” into Arabic, Italian, Chinese and Russian. Sounds easy? Not so fast …
The Chinese guy replies with the one phrase that [all variations of the second sentence] translate to in Chinese, and that phrase has two “%g”s in it, as it should — but there’s a problem. He translates it word-for-word back: “In %g directories contains %g files match your query.” The %g slots are in an order reverse to what they are in English. You wonder how you’ll get gettext to handle that.
But you put it aside for the moment, and optimistically hope that the other translators won’t have this problem, and that their languages will be better behaved — i.e., that they will be just like English.
But the Arabic translator is the next to write back. First off, your code for “I scanned %g directory.” or “I scanned %g directories.” assumes there’s only singular or plural. But, to use linguistic jargon again, Arabic has grammatical number, like English (but unlike Chinese), but it’s a three-term category: singular, dual, and plural. In other words, the way you say “directory” depends on whether there’s one directory, or two of them, or more than two of them. Your test of ($directory == 1) no longer does the job. And it means that where English’s grammatical category of number necessitates only the two permutations of the first sentence based on “directory [singular]” and “directories [plural]”, Arabic has three — and, worse, in the second sentence (”Your query matched %g file in %g directory.”), where English has four, Arabic has nine. You sense an unwelcome, exponential trend taking shape.
Your Italian translator emails you back and says that “I searched 0 directories” (a possible English output of your program) is stilted, and if you think that’s fine English, that’s your problem, but that just will not do in the language of Dante. He insists that where $directory_count is 0, your program should produce the Italian text for “I didn’t scan any directories.”. And ditto for “I didn’t match any files in any directories”, although he says the last part about “in any directories” should probably just be left off. […]
Then your Russian translator calls on the phone, to personally tell you the bad news about how really unpleasant your life is about to become:
Russian, like German or Latin, is an inflectional language; that is, nouns and adjectives have to take endings that depend on their case (i.e., nominative, accusative, genitive, etc…) — which is roughly a matter of what role they have in syntax of the sentence — as well as on the grammatical gender (i.e., masculine, feminine, neuter) and number (i.e., singular or plural) of the noun, as well as on the declension class of the noun. But unlike with most other inflected languages, putting a number-phrase (like “ten” or “forty-three”, or their Arabic numeral equivalents) in front of noun in Russian can change the case and number that noun is, and therefore the endings you have to put on it.
He elaborates: In “I scanned %g directories”, you’d expect “directories” to be in the accusative case (since it is the direct object in the sentence) and the plural number, except where $directory_count is 1, then you’d expect the singular, of course. Just like Latin or German. But! Where $directory_count %10 is 1 (”%” for modulo, remember), assuming $directory_count is an integer, and except where $directory_count %100 is 11, “directories” is forced to become grammatically singular, which means it gets the ending for the accusative singular… You begin to visualize the code it’d take to test for the problem so far, and still work for Chinese and Arabic and Italian, and how many gettext items that’d take, but he keeps going… But where $directory_count %10 is 2, 3, or 4 (except where $directory_count %100 is 12, 13, or 14), the word for “directories” is forced to be genitive singular — which means another ending…
This said, for translations of single words, or text without variables, esp. in a short script, gettext
is perfectly adequate. But there’s another problem: blogs, while technically software (PHP scripts, in our case) face different problems from desktop utilities or the like. The text to be translated needs to be user-editable. Every blog is different, and bloggers will want the text — any bit of text — to appear just like they prefer it. Which, for the moment, is quite difficult to achieve, on a multilingual blog.
Which reminds me once again how regrettable it is that written communication better take place in one language at a time. Spoken communication is much more flexible in this regard. (One exception are discussions on IRC or other public chat channels: I’ve often found it useful to carry on two separate conversations with the same interlocutors in two different languages; it’s easier to keep the conversations apart this way.)