Posted 3 years, 9 months ago at 21:56. 1 comment
The roll-out of nightly l10n updates has been…bumpy. The primary user-visible symptom of this has been that nightly updates for en-US have sometimes been delayed by many hours when compared to when they would have been generated previously.
I hesitate to say that these consequences were unforeseen, but rather that we were initially unsure how/if the various systems we use to generate, store and serve updates would even cope when they had to deal with more than one locale.
We started by setting up a nightly update system in our release engineering staging environment, something that was long overdue anyway. Unfortunately, while this allowed us to test the functionality of our code changes, it didn’t allow us to test at anywhere near the scale we run at in our production environment. The staging update server was easily able to keep up with the trickle of new builds and mars coming in from a small pool of staging slaves.
Of course, when we moved from staging to production, having a single virtual machine (VM) producing updates for all locales across all branches quickly proved to be a formidable bottleneck. We started by only generating nightly updates for en-US and all locales on mozilla-central. We then started generating and serving updates for the new 1.9.2 branch shortly after it was created. At that point, we had to pause and take a good hard look at the time it was taking our single update VM to generate all the partial updates we were expecting of it.
Instead of only 9 partial updates per day (3 active code branches x 3 platforms, en-US only), we were now expecting the system to generate >400 partial updates (2 code branches with l10n nightly updates enabled x 3 platforms x ~70 locales) per day. Each update takes between 1 and 2 minutes to generate, so in a worst-case scenario, it could take up to 14 hours to create all the partial updates. If we had added another branch at that point, we might well have had gotten into a cycle where we never caught up with pending updates!
The nightly update generation script/cronjob had a few flaws that quickly became evident under the new load.
First, the script processed all pending updates in a single go when kicked off via cron. All the new builds that have been created since the last attempt are synced to the VM and then update generation commences. Any new builds that are created while the current iteration of the script is running have to wait until the entire current batch is finished and the crontab entry fires again.
Second, there was no prioritization of builds within the update generation script. Generally, it was first-in-first-out (FIFO) but it also depended on the vagaries of how the internal hash was created. This meant that en-US updates no longer occurred immediately, but whenever they happened to enter the queue. This turned out to be especially bad for Windows, our most popular platform.
Due to the increased cycle time due to profile-guided optimization (PGO), nightly builds on Windows often finish after 7am PDT. Unfortunately, all our l10n nightly repacks were scheduled to start at 7am PDT. This caused a few unfortunate problems. Sometimes l10n nightly repacks would repack the Windows build from the previous day, since the current nightly was still in progress. New l10n builds from other platforms would also often enter the queue for nightly update generation before the new en-US builds for Windows were ready. Windows nightly updates sometimes did not get processed until late in the afternoon! This was a serious regression for nightly testers.
A couple of quick fixes were available. Armen added some code to the nightly update generation script to sort the pending update graph so that any pending en-US updates in the current batch would always be processed first. I also changed the way l10n nightly builds were being created so that they are triggered as soon as the corresponding base en-US nightly is finished building. This allowed builds to enter the nightly update generation queue sooner (well, on Linux and Mac at least), but also fixed the longstanding problem on Windows of sometimes creating l10n repacks using the build from the previous day. Together, these changes sped up the entire nightly update generation process by about 2 hours in aggregate. They also allowed us to get nightly updates for en-US on Windows out before noon PDT. This is acceptable in the interim, but not ideal.
The timing is fortuitous though, since I had already planned to tackle other aspects of the nightly update generation system over the next few weeks. It is a natural extension of that work to change nightly update generation from a cronjob happening on a single VM to build step that happens on the build slave at the end of each nightly build. We already have this parallelism setup, so why not use it?
For any nightly testers out there, hopefully you can bear with us for a little while. We’ve greatly swelled your (potential) numbers by opening up nightly updates to the l10n community, but it will be a few more weeks until things are back to normal.