Posted 3 years, 8 months ago at 02:46. 4 comments
I landed a patch last week that represents the culmination of several months worth of effort by me, but that also brings to fruition the full promise of l10n nightly updates, work that was begun by Armen almost a year ago.
History and motivation
For almost as long as there has been a Firefox browser, there has been a nightly update channel whereby bleeding edge developers could keep their browser up-to-date with the latest code changes from the past day. For lay people, imagine the security update notices you receive every month or so, but happening on a daily basis. There are changes made every day in the Mozilla codebase — bug fixes, new features, interface changes — so it’s important for us to have a vector to get these changes into testers’ hands as quickly as possible. Since the people that use these “nightly” builds are often developers or involved community members, the likelihood of us getting feedback on any code changes introduced in these builds is relatively high.
To minimize the download size of each nightly update for our testers, we offer what we call a “partial” update, where we provide only the accumulated binary code differences between the new nightly build and the previous one. If testers are more than one nightly update behind, we offer them a “complete” update, which is essentially just an archive of the latest nightly build packaged in a format that can be understood by the browser’s updating framework.
The developers and community members who use these nightly builds and updates don’t necessarily live in the U.S. or speak English as their first language, however. Armen took the first huge step towards better serving these users last year when he added nightly update support for localized builds (we call these builds l10n for short) on our core development branch, mozilla-central. For the first time ever, non-English users could get the same bleeding edge code as English speakers, but presented in their own language.
Useful, but at what cost
Mozilla has many different active code branches, all of which produce nightly builds for testing. Having proven its utility with the mozilla-central nightly builds, we wanted to extend the same l10n nightly update functionality to our other major branches, specifically the 1.9.2 and 1.9.1 code lines. However, after turning on nightly updates for the 1.9.2 branch, we soon discovered a bottleneck in our update process.
Since we first began offering nightly updates, all of our nightly updates had been generated on a single linux machine (or more recently, a single Linux VM). This was fine when there was only one locale to worry about, English (en-US). The update for each nightly build took between 2-3 minutes to generate, so even with 4 code branches (mozilla-central, 1.9.2, 1.9.1, CVS trunk) and 3 platforms (Linux, Mac OS X, Windows), updates would only ever take about 30 minutes to generate, total, and they would rarely be blocked because builds never finish at exactly the same time. Adding localized builds into the mix threw everything into disarray.
Mozilla prides itself on it’s localization story, but the sheer number of available localizations (around 75 per code branch) was swamping our poor, little update generation VM. As a rough estimate:
2.5 minutes/update x 3 update platforms x 75 locales
= 562.5 minutes
= >9 hours to generate all the nightly updates for one code branch
When we turned on the l10n nightly updates for the 1.9.2 branch, we had to make a number of changes immediately.
First, we certainly couldn’t consider turning on updates for another branch or nightly update generation would simply never catch up with itself. Because it would take over 18 hours to generate all the required updates for just two branches, we prioritized the English updates (en-US) so as not to degrade the existing nightly update experience for the consumers of English builds.
At this point, nightly updates for some locales were not timely at all, e.g. the Windows nightly build for the Chinese Traditional, Taiwan (zh-TW) locale would wait in the queue over 12 hours every day before having it’s partial update generated. In the timezones where a zh-TW localized build would most likely be used, this meant the base code of the “updated” build was always out-of-date by an entire day.
Adding a new platform, or even a new locale, was painful in a multiplicative rather than an additive way. Also, if we ever needed to re-spin a set of nightlies (which implies a new round of nightly updates), it meant that nightly update generation wouldn’t catch up for over 24 hours.
Make it right
Under these conditions, the decision to parallelize update generation was easy, the only question was how. We did consider simply adding more Linux VMs to generate updates, but that would have involved writing code to manage the pending update queue across multiple machines, or at the very least introducing an dedicated update generation machine per platform. The more straightforward solution would be to use the massive parallelization already provided by our existing build slave pool.
The first hurdle in generating our nightly updates on the build slaves was ensuring that the MAR (Mozilla ARchive) and binary diff tools worked on platforms other than Linux (which they now do) and were built automatically for branches on which we wanted to provide nightly updates (which they are now). There were also a bunch of build system changes that needed to happen to get MAR files uploaded automatically from the build slaves. After that, it was testing, lots and *lots* of testing. Last week, we finally got out of the testing phase and were able to deploy the changes for mozilla-central.
There are always trade-offs when making a change like this, but honestly this is a pretty big win. Here are some typical l10n build scenarios from last week compared to the previous week before the updates-on-slave changes were in place:
|Platform||Last Build Finished (before)
|Last Build Finished (now)
|Update available (before)
|Update available (now)
|Linux||03:44||04:22||1.5 – 5||18:05||04:18|
|Mac OS X||06:10||06:24||2.5 – 4.5||16:55||06:24|
|Windows||05:45||06:58||3 – 4||19:50||06:58|
In aggregate, we’ve added a maximum of 5 minutes/locale (often much less) x ~75 of locales, or 375 minutes, to the total time required to build all locales for a given branch on a given platform. Given 8 build slaves per platform assigned to create l10n nightly repacks, this means that the final l10n nightly repack on any build slave will finish roughly 45 minutes later than it did before the change. The parallelization of the overall process more than compensates though, at least as far as nightly updates are concerned. For example, the zh-TW localized Windows build may finish at 07:00 PDT now instead of 0615 PDT, but it’s nightly update is also immediately available at 07:00 PDT. English (en-US) builds also have their updates generated on the slaves, but since their updates were prioritized before anyway, they continue to be timely.
The localized builds also need to build the binary diff tool (bsdiff), but we re-use the same checkout for the tools from one l10n build to the next, so we only ever have to incur that build cost once, in the absence of a clobber. Even when we do need to rebuild bsdiff, since we also re-use the existing configuration steps and make targets that have already been run earlier in the l10n build process, the maximum time we ever spend building the bsdiff tool is about 1 min.
Perhaps what’s most exciting about this change are the possibilities it opens up for us, and for others.
Nick is already busy extending this work to provide nightly updates for project branches. Project branches are a relatively new thing in the Mozilla world, and soon the developers working on them will be able to stay current code-wise just the same as if they were working on mozilla-central. KaiRo also tells me that he’s only a few days away from having nightly partial updates available for SeaMonkey for the first time ever!