XULRunner in large projects, part 4: Localization pitfalls

I am back from the Mozilla Summit and somewhat managed to process all the new information I got there. But instead of posting yet another summit summary or more summit photos (what, you didn’t know how great this summit was?) I have a far more boring topic for today: localization of XULRunner-based applications.

I mean, what is there to say about localization? It is really very simple. Some magic in the chrome:// protocol makes sure that whenever a file in the locale “subdirectory” is accessed one of the available locales is selected and the file is loaded from there. This automatic selection mechanism works very well and will select the locale that is closest to the value of the general.useragent.locale preference.

File formats

A typical locale contains files of two types. The DTD file format is part of the XML specification and can be used with any XML file (which includes XUL and XHTML files). The idea is to associate XML entities in DTD files with localized strings, the XUL document only references the entities then. This is a rather unorthodox use of DTD files but the approach clearly has the advantage of not requiring any special handling, the browser simply processes an XML file as it would usually do it. The downside however is that the DTD format requires a significant amount of boilerplate and leaves much room for mistakes. And any mistake in a DTD file (missing entity definition, syntax error, invalid character, Byte Order Mark) results in a fatal error — the entire XUL file is rejected with a parsing error. The other issue is that including multiple DTD files into a XUL file is complicated and rather counterintuitive.

Of course, localized data isn’t only used in XUL and XHTML, JavaScript code often needs some localized strings as well. There is no good way to access DTD files from JavaScript however, the properties file format is used instead. This is a minimalistic format originating from Java which simply lists key/value pairs. It can be accessed either via scriptbundle tag or nsIScriptBundle interface. Unfortunately, the method names are different depending on which one you use which certainly doesn’t help code consistency/readability. On the bright side, syntax errors are impossible by definition and the only problem you could run into is a missing string — retrieving that string will throw an exception.

Ensuring working localizations

So at the moment the historically grown localization landscape in XULRunner is somewhat inconsistent. But this inconsistency is merely a minor annoyance and something that L20N efforts will hopefully make go away soon. Fatal errors due to localization errors however are significantly more problematic and were haunting TomTom HOME for example quite regularly during the early phases of the project. Turns out that you cannot really trust localizers to deliver DTD files that use the correct encoding, have no BOM and are free of syntax errors. Given that localized application versions typically get less testing these mistakes would sometimes go unnoticed. And it simply cannot always be guaranteed to have translations of all strings in all locales, particularly not in the middle of a development cycle. But it would be nice to always have usable localized builds.

So, what you need for working localizations:

  • Validation: Ensure that the localization files use UTF-8 encoding without BOM and check syntax (makes sense even for properties files — any “trash” that will be ignored by the browser indicates an issue). Ideally the tools used by localizers to create the translated files already ensure valid format, otherwise scripts will need to be used for this job.
  • Completeness: Locales have to be compared against the base locale to find missing or unnecessary strings. Ideally, the scripts used here will also add missing strings from the base locale to prevent errors in the build (arguably, this fallback behavior should be implemented in XULRunner, yet it isn’t).

Mozilla apparently has a set of scripts called l10n-checks to do this job. Unfortunately, I am not familiar with it and cannot say whether it is a complete solution for the problems above. Documentation doesn’t really make it clear either. For TomTom HOME I had to write custom scripts and Songbird also uses its custom solution from what I can tell (I didn’t look too closely though).

Getting good localizations

But wait, a working localization doesn’t necessarily mean a good localization — it might contain pretty crappy translations. And finding good translators is only one step towards good localization. Some of the other steps are:

  • Find a good translation environment for translators to use. Mozilla uses narro and Verbatim, I don’t know much about the merits of either unfortunately.
  • Make sure to provide translators with some context about the strings they are translating. This means first of all having developers choose meaningful string IDs that describe the function of a string rather than its value. And it also means adding comments to explain how a string is used if it isn’t obvious.
  • If the space for a particular string is limited this should be communicated to translators. Remember that English is a very compact language, translations will often be significantly longer. Oh, and no — telling translators about the size constrains doesn’t mean that testers no longer need to check whether any localized strings are cut off or make the layout look bad.
  • Avoid inserting numbers or words dynamically into a sentence, use different static variants of the same sentence if possible. Building together a sentence dynamically might work well in English but will usually get very complicated in other languages (at least if you want to get a result that sounds somewhat correct). L20n is meant to address this issue though I have my doubts here.

XULRunner locales

Once you’ve done your homework and got great localizations for your application you might notice an issue: some strings are not localized, e.g. labels of default alert dialog buttons, the entire add-on manager or error console UI, some error messages. Yes, these strings are not part of your application, they are part of XULRunner. The good news: XULRunner locales are all there, you can get them. The bad news: XULRunner locales aren’t exactly small, around 150 kB (compressed) or more. If you played with the idea of putting all the available locales of your application into one installation package this is quite a setback — already including 20 XULRunner locales will increase the download size by 3 MB.

So, what are the options:

  • Do not offer installation packages with multiple locales, that’s what Firefox does. The disadvantage: the user has to decide on a language before download and cannot change his mind afterwards.
  • Download additional locales automatically when the user selects a different locale. I am not aware of any application that chose this approach, probably because even building all the required XULRunner locales is rather complicated.
  • Discover that there are only few places where XULRunner strings “shine through” and replace these by your own UI. That’s the approach that TomTom HOME followed pretty consistently (which was a pain for developers) and Songbird less consistently (which is probably a pain for users).

Do you know a perfect solution? I don’t.

Comments

  • johnjbarton

    Firebug project does not have a perfect solution either, but we’ve made some progress. We removed all DTD files for the reasons you outlined. We have our own localization functions, one that takes a list of element ids to translate the static UI, and two that translate strings added dynamically (one for simple string substitution and one for parameters). The translation falls back to US-en if the locale fails to have an entry. The property files are uploaded to BabelZilla where translators do their magic.

    This scheme gives two fall backs: if the locale entry is missing, the US-en value is used; if the US-en value is missing the label is still readable and usable. Failures are not catastrophic and it’s easy to notice a missing translation. These are all properties important for a small project like ours.

    Our next step will be to allow users to opt out of the translation even if they use a localized Firefox. This is one of our most requested features (perhaps because development tools generally assume English readers).

  • anonymous

    “Avoid inserting numbers or words dynamically into a sentence, use different static variants of the same sentence if possible.”

    not sure I understand what you mean by that

    I thought in some cases you had used in ABP one full sentence with placeholders for numbers e.g. # out of #

    Wladimir Palant

    “# out of #” is actually unproblematic, it isn’t a full sentence. But “# filter subscription(s) and # custom filter(s) in use” is pretty ugly when translated and I am not very happy about that. As I said, “if possible” – in this particular case I couldn’t see another solution that would still be compact.

  • What about BabelZilla ?

    Wladimir Palant

    For XULRunner applications? Unusable IMO. Frankly, it has enough issues of its own that even translating extensions with it is pretty painful.

  • Goofy

    Very interesting blogpost, I am glad to see sometimes one real developer has such insight about localization problems and take them seriously into account, probably because he hits issues unfortunately.
    I also have hopes and scepticism about L20n project, which is not surfacing as quickly as expected (?).

    I agree with most recommendations in the Getting good localizations § except :

    * Though not really satisfying as a solution, I don’t think various placeholders in a string may annoy experienced translators.

    As for the user downloading any other language than the app useragent locale, this feature is available in Songbird but requires restart.

    About BabelZilla : sure there may be bugs and various annoyances, but it is a bit an overstatement to describe extension translation as painful :D.
    Serious developers are welcome to help and make our process better.
    As for xulrunners, considering the locale structure is just as classical as in an extension, babelzilla is just as usable, or as problematic if you want :P
    (translating langpacks for songbird, komodo, spicebird proved not to be such a big challenge)

  • Axel Hecht

    Hi Wladimir, nice post.

    A few comments: You’re using “Mozilla does…” quite generously. l10n-checks for example seem to be a project by Adrian Kalla, probably related to the l10n checks that AMO does. Sadly, I don’t know their details either. We’re not using verbatim for app l10n, just for plain PO web content/we apps. narro is in use by localizers for the apps, but community-maintained.

    I would think that you’re overestimating the issues around getting xulrunner langpacks, it depends mostly on whether your app supports add-ons. If it does, you can use langpacks rather nicely, and creating them for your apps shouldn’t be that hard to reverse engineer from fx these days. Not trivial or shrink-wrapped, granted, but as said in other posts of yours, xulrunner ain’t that anyway ;-). Executive summary, your own files plus libs-ab-CD in toolkit/locales, plus some decent packaging in a jar.

    I would also like to stress that we’ll likely continue to put quite a bit of burden on the build process for l10n data, DTD/properties or l20n alike. The tests to be run are neither simple, nor is the amount of data involved small, so anything that can be statically checked at build time should be. If nothing else than for performance reasons.

    I have another patch on hold that’ll make the compare-locales checks stronger still, and make l10n-merge fix found errors (by replacing them with en-US, plurals-badness withstanding). I need to pick that up again.

    Another interesting aspect is that Adrian is working on a project to run l10n-focused mozmill tests, in particular we have code in stock that will look for cropped text and create screenshots of those problems. If your xulrunner app works with mozmill, you might be able to re-use some of that.

  • Mook

    Songbird does indeed do the automatic-download thing; it’s not perfect, though, since it makes offline installs show up in en-US only. I think one of our hardware partners (i.e. who actually ship media to end users) include a few key languages in their installer since the size pressure is smaller there. If we had more time for this stuff, I’d have liked to ship locale-specific installers as well; alas…

    I believe we actually grab the Firefox release locale xpis and slip it into the Songbird locale extension (with a rewritten manifest of course).

    Our actual setup for l10n is… well, almost workable, but not quite perfect. Localizers want things like actual files to check in/out (so their existing tools can work with them better, and the ability to work offline, etc.), and the locale packaging can still have hiccups. Sadly yet another bit that needs more time. :(

    Sadly, I suspect l20n won’t be able to make the Gecko 2.0 (Firefox 4) train… The version switch ended up being rather sudden and there isn’t enough time to switch the app over, even if the engine itself was completed in time.

  • MarkC

    The situation is even worse for remote XUL apps: bug #22942 means that DTDs have to be injected into the page directly if you use them, and I wasn’t able to get string bundles to work for remote XUL.

    Currently we inject our DTD, and use a separate JS function to download and parse the DTD separately if we need to use any strings within our JS code.

    Injecting our monolithic DTD is too heavy – our DTD content is bigger than the whole of the rest of some apps. If #22942 were to get fixed then this would be less of an issue, as the DTD would usually be cached, but instead we’re having to look at re-writing our server-side code to only inject the entities needed in a particular app.

    Wladimir Palant

    Actually, there is a bug on disabling remote XUL altogether for security reasons. Things were going that way, supporting remote XUL requires tons of effort and Mozilla has no reason to invest into it. Quite the opposite – pushing a proprietary technology to the web goes against Mozilla’s goals.

  • MarkC

    Yeah, I know about that bug – and I’m dreading the day the change actually lands.

    I agree with the idea in principle, but open web technologies aren’t mature enough to easily replace what we do with remote XUL right now. Things are moving in the right direction, but we’re not there yet. Our software is specifically designed for intranet use (like most remote XUL apps, I suspect) where the use of open formats is less of a concern than on the web at large.

    I just hope that bug lands far enough in the future for us to have replaced most of our remote XUL widgets with XBL2 and HTML. We’ll still have to roll our own alternatives to overlays and templates though, as I don’t see any open equivalent of those coming anytime soon.

    Wladimir Palant

    It seems to be official that this change will land for Firefox 4 – and it is supposedly a necessary step towards supporting XBL2.