Posted 2 months, 1 week ago at 11:46. 8 comments
Greg Szorc has been tireless in pushing for improvements in the build system. This past summer, he added automatic psutil system usage reporting for all Mozilla automation jobs run by mozharness. Since release engineering is actively moving all jobs to mozharness, we should soon have efficiency metrics for all jobs.
Unfortunately, Greg tried to parlay that work into a larger analysis of overall infrastructure machine efficiency. His analysis is not wrong, per se, but his post presents the state of machine efficiency with the assumption that it is that way by accident. Greg did post an addendum at the bottom of the entry, but I don’t think that edit ever found the same traction that the original piece did. I’d like to try to address why our machine efficiency numbers look the way they do.
First, let’s start off with some definitions and clarifications so we’re all talking about the same things.
There are two types of automation jobs we care about: builds and tests. Builds can be subdivided into try builds and regular builds, i.e. builds triggered by check-ins to non-try repositories covered by continuous integration.
Both types of build jobs spawn corresponding test jobs. In theory, people who submit jobs to the try server can limit the types of test jobs spawned by their build jobs. Few people do. I’ll try to be clear when I’m speaking about build or test jobs specifically, but in general, we care about the aggregate of build+test jobs more than each type individually for reasons discussed below.
There are a bunch of metrics we care about, regardless of job type:
- job efficiency: This is what Greg is measuring with psutil. It boils down to how well are we using the system resources during any particular job, as measured by CPU, memory usage, etc.
- machine efficiency: how well are we using system resources over a given time frame, e.g. per day. IT cares about this a great deal, especially for in-house systems (most test machines), so we’ve recently been deploying collectd everywhere we can.
- wait times: the time between developer submitting a patch and when a trial build based on that patch starts compiling
- end-to-end time: how long from when a developer lands a patch until the final test result spawned by those builds finishes reporting
Of the above metrics, wait times is the most important to Mozilla.
Release engineering has made the following commitment to developers:
95% of jobs will start within 15 minutes of submission
This commitment applies for *both* build and test jobs, i.e. a build job must start with 15 minutes of a patch landing, and then when that build has finished building, all the test jobs spawned by that build must start within 15 minutes of the end of the build. Wait time statistics are collected daily and mailed to the dev.tree-management newsgroup.
How are we doing with meeting that commitment?
We continue to meet the 95% commitment for build jobs, both try and regular. Because of the nature of test jobs, with many tests spawning from a single build, our test infrastructure is *not* currently keeping up. Over the past year, efforts by the A-Team, IT, sheriffs, and release engineering to improve reliability and manageability have made things better, though.
How much better?
9 months ago (and earlier), we had wild variability in wait times for test jobs, bouncing between 50% and 70% of jobs starting within 15 minutes, despite having only a fraction of the “normal” load we see now. Our peak load at the time was 3,700 build jobs/day and 44,000 test jobs/day, with the average non-weekend numbers hovering around 3,000 builds jobs/day and 35,000 test jobs/day.
At present, we regularly hit 85% of test jobs starting within 15 minutes, despite dealing with many more total jobs per day. Our new “normal” load is 5,000 build jobs/day and 50,000 test jobs/day. John O’Duinn recently blogged about the dramatic increase in job traffic we’ve experienced over the past year, including regular new high watermarks for both build and test jobs.
Why are we still failing to meet our test capacity commitment?
There are a few stand-out reasons:
- Simple volume: As John notes, we handle many more jobs per day than we did a year ago.
- Distribution of jobs: We currently run 254 compute hours worth of jobs per check-in, but those check-ins don’t happen at regular intervals. Developers are clustered in certain timezones, and those timezones have standard working hours. Few people land code at midnight PDT on Saturday, but it turns out almost everybody lands code between 2-3pm PDT on Tuesday
- Older platform support: We support older platforms based on OS market share data. The pool of available hardware for older platforms is often limited. This situation is especially acute on Mac. 10.6 (Snow Leopard) is still a substantial chunk of our Mac user base, but because of Apple’s aggressive policy of de-supporting old hardware, we are essentially stuck with the existing test infrastructure we have for older Mac platforms. Even on platforms where market share numbers are smaller, e.g. 10.7 (Lion), it is a tough sell to turn off or reduce testing where it already exists.
- We keep adding new tests: It’s not an everyday event (because it’s not easy, certainly), but the number of tests we’re running increases steadily over time.
- We never turn off old tests: Maybe it’s the pain of getting new tests past the bar of initial acceptance, but in my 8 years at Mozilla, I’ve never seen us turn off, much less de-prioritize, a test suite that is already running in production. Coupled with adding new tests, this means that the amount of testing we do never goes down, only up.
To return to Greg’s post that started this discussion, Mozilla release engineering is actively optimizing for burst capacity and developer turnaround time rather than for machine utilization or some other metric.
Why did we choose to optimize for wait times versus other metrics?
It’s simple, really. It’s a recognition that other factors are out of our (release engineering’s) control: builds are recursive and poorly optimized, and there are oh-so-many tests. You can’t finish what you don’t start, so despite the length of individual builds and the sheer volume of tests, we commit to starting all jobs with 15 minutes because it gives developers the feedback they need as quickly as possible.
Is this the only way to think about efficiency? No, but it’s the one that we’ve agreed upon over many years of discussion between developers, contributors, sheriffs, and management. We could conceivably have 100% machine utilization with smaller pools of machines if developers were willing to wait longer for results. That is not currently the expectation of contributors, and I reckon it would take a fair amount of lobbying to change that expectation.
“You can expect what you inspect.” – W. Edwards Deming
Nobody is resting on their laurels here, mind you. I will say that I am encouraged by the recent fervor surrounding efficiency. More efficient jobs have a cascade effect on wait times through end-to-end times: each job completes more quickly and allows us to move on to the next job. Small improvements add up, especially when multiplied across thousands of builds jobs, or tens or thousands of test jobs, per day.
Greg’s work to maximize CPU usage in the build process, glandium’s work to remove recursion in the build system, and investigation into alternate build systems like Tup are all promising signs on the build efficiency front. If we can also fix some known long-pole bad actors like Windows PGO builds, or two-pass Mac universal builds, we’ll be able to spawn tests from builds much sooner.
In terms of test efficiency improvements, release engineering and A-Team continue to make adjustments to running test suites. We look at the overall time it takes to run a given testsuite, weigh that against the setup and teardown time for that suite, and then decide whether to subdivide that suite into multiple parts. In addition, the A-Team continues to work at parallelizing existing test suites. Some suites like XPCshell are already done, while others are upcoming. As we expand our testing with emulators, we are investigating running multiple emulators simultaneously on a single physical machine to increase throughput.
The current set of expectations around machine utilization and wait times has been informed by many years of conversation between interested parties. Want to start a new conversation? Comment, write me, ping me in IRC, or pull me (or your local release engineer) aside at the Summit this week to discuss.