Wait times: why measuring machine efficiency is hard, and why we still need to try

Posted 10 months, 3 weeks ago at 11:46. 8 comments

Dilbert.comGreg Szorc has been tireless in pushing for improvements in the build system. This past summer, he added automatic psutil system usage reporting for all Mozilla automation jobs run by mozharness. Since release engineering is actively moving all jobs to mozharness, we should soon have efficiency metrics for all jobs.

Unfortunately, Greg tried to parlay that work into a larger analysis of overall infrastructure machine efficiency. His analysis is not wrong, per se, but his post presents the state of machine efficiency with the assumption that it is that way by accident. Greg did post an addendum at the bottom of the entry, but I don’t think that edit ever found the same traction that the original piece did. I’d like to try to address why our machine efficiency numbers look the way they do.

First, let’s start off with some definitions and clarifications so we’re all talking about the same things.

There are two types of automation jobs we care about: builds and tests. Builds can be subdivided into try builds and regular builds, i.e. builds triggered by check-ins to non-try repositories covered by continuous integration.

Both types of build jobs spawn corresponding test jobs. In theory, people who submit jobs to the try server can limit the types of test jobs spawned by their build jobs. Few people do. I’ll try to be clear when I’m speaking about build or test jobs specifically, but in general, we care about the aggregate of build+test jobs more than each type individually for reasons discussed below.

There are a bunch of metrics we care about, regardless of job type:

  • job efficiency: This is what Greg is measuring with psutil. It boils down to how well are we using the system resources during any particular job, as measured by CPU, memory usage, etc.
  • machine efficiency: how well are we using system resources over a given time frame, e.g. per day. IT cares about this a great deal, especially for in-house systems (most test machines), so we’ve recently been deploying collectd everywhere we can.
  • wait times: the time between developer submitting a patch and when a trial build based on that patch starts compiling
  • end-to-end time: how long from when a developer lands a patch until the final test result spawned by those builds finishes reporting

Of the above metrics, wait times is the most important to Mozilla.

Release engineering has made the following commitment to developers:

95% of jobs will start within 15 minutes of submission

This commitment applies for *both* build and test jobs, i.e. a build job must start with 15 minutes of a patch landing, and then when that build has finished building, all the test jobs spawned by that build must start within 15 minutes of the end of the build. Wait time statistics are collected daily and mailed to the dev.tree-management newsgroup.

How are we doing with meeting that commitment?

We continue to meet the 95% commitment for build jobs, both try and regular. Because of the nature of test jobs, with many tests spawning from a single build, our test infrastructure is *not* currently keeping up. Over the past year, efforts by the A-Team, IT, sheriffs, and release engineering to improve reliability and manageability have made things better, though.

How much better?

9 months ago (and earlier), we had wild variability in wait times for test jobs, bouncing between 50% and 70% of jobs starting within 15 minutes, despite having only a fraction of the “normal” load we see now. Our peak load at the time was 3,700 build jobs/day and 44,000 test jobs/day, with the average non-weekend numbers hovering around 3,000 builds jobs/day and 35,000 test jobs/day.

At present, we regularly hit 85% of test jobs starting within 15 minutes, despite dealing with many more total jobs per day. Our new “normal” load is 5,000 build jobs/day and 50,000 test jobs/day. John O’Duinn recently blogged about the dramatic increase in job traffic we’ve experienced over the past year, including regular new high watermarks for both build and test jobs.

Why are we still failing to meet our test capacity commitment?

There are a few stand-out reasons:

  • Simple volume: As John notes, we handle many more jobs per day than we did a year ago.
  • Distribution of jobs: We currently run 254 compute hours worth of jobs per check-in, but those check-ins don’t happen at regular intervals. Developers are clustered in certain timezones, and those timezones have standard working hours. Few people land code at midnight PDT on Saturday, but it turns out almost everybody lands code between 2-3pm PDT on Tuesday
  • Older platform support: We support older platforms based on OS market share data. The pool of available hardware for older platforms is often limited. This situation is especially acute on Mac. 10.6 (Snow Leopard) is still a substantial chunk of our Mac user base, but because of Apple’s aggressive policy of de-supporting old hardware, we are essentially stuck with the existing test infrastructure we have for older Mac platforms. Even on platforms where market share numbers are smaller, e.g. 10.7 (Lion), it is a tough sell to turn off or reduce testing where it already exists.
  • We keep adding new tests: It’s not an everyday event (because it’s not easy, certainly), but the number of tests we’re running increases steadily over time.
  • We never turn off old tests: Maybe it’s the pain of getting new tests past the bar of initial acceptance, but in my 8 years at Mozilla, I’ve never seen us turn off, much less de-prioritize, a test suite that is already running in production. Coupled with adding new tests, this means that the amount of testing we do never goes down, only up.

To return to Greg’s post that started this discussion, Mozilla release engineering is actively optimizing for burst capacity and developer turnaround time rather than for machine utilization or some other metric.

Why did we choose to optimize for wait times versus other metrics?

It’s simple, really. It’s a recognition that other factors are out of our (release engineering’s) control: builds are recursive and poorly optimized, and there are oh-so-many tests. You can’t finish what you don’t start, so despite the length of individual builds and the sheer volume of tests, we commit to starting all jobs with 15 minutes because it gives developers the feedback they need as quickly as possible.

Is this the only way to think about efficiency? No, but it’s the one that we’ve agreed upon over many years of discussion between developers, contributors, sheriffs, and management. We could conceivably have 100% machine utilization with smaller pools of machines if developers were willing to wait longer for results. That is not currently the expectation of contributors, and I reckon it would take a fair amount of lobbying to change that expectation.

“You can expect what you inspect.” – W. Edwards Deming

Nobody is resting on their laurels here, mind you. I will say that I am encouraged by the recent fervor surrounding efficiency. More efficient jobs have a cascade effect on wait times through end-to-end times: each job completes more quickly and allows us to move on to the next job. Small improvements add up, especially when multiplied across thousands of builds jobs, or tens or thousands of test jobs, per day.

Greg’s work to maximize CPU usage in the build process, glandium’s work to remove recursion in the build system, and investigation into alternate build systems like Tup are all promising signs on the build efficiency front. If we can also fix some known long-pole bad actors like Windows PGO builds, or two-pass Mac universal builds, we’ll be able to spawn tests from builds much sooner.

In terms of test efficiency improvements, release engineering and A-Team continue to make adjustments to running test suites. We look at the overall time it takes to run a given testsuite, weigh that against the setup and teardown time for that suite, and then decide whether to subdivide that suite into multiple parts. In addition, the A-Team continues to work at parallelizing existing test suites. Some suites like XPCshell are already done, while others are upcoming. As we expand our testing with emulators, we are investigating running multiple emulators simultaneously on a single physical machine to increase throughput.

The current set of expectations around machine utilization and wait times has been informed by many years of conversation between interested parties. Want to start a new conversation? Comment, write me, ping me in IRC, or pull me (or your local release engineer) aside at the Summit this week to discuss.

Current Tunes: Audien - Wayfarer | Filed under Build/Release, Firefox, Hardware, Mozilla, Software |

8 Replies

  1. “It’s a recognition that other factors are out of our (release engineering’s) control: builds are recursive and poorly optimized, and there are oh-so-many tests.”

    What prevents RelEng from fixing low hanging performance problems in tests? You seem to have the overall view of the problem, focus and expertise.

  2. “It’s a recognition that other factors are out of our (release engineering’s) control: builds are recursive and poorly optimized, and there are oh-so-many tests.”

    There’s a factor that’s under releng’s control and that I don’t see addressed in this post, and that sadly hasn’t gotten traction in the past: the wait time for *actual* builds to start. It’s nice that we’re committed to 15 minutes for a build to start on a slave, but how good is it if it takes another 20 minutes for the actual build to start (where by actual build, I mean make -f client.mk being started).

  3. And wrote:
    > “What prevents RelEng from fixing low hanging performance problems in tests?”

    Such as? I feel like you’re wanting to call out a specific bug number here.

    If we’re missing something easy, please let us know. Both the A-Team and releng are looking at these things, but admittedly we’re so close to the problems that it’s entirely possible we overlook things.

  4. glandium wrote:
    > There’s a factor that’s under releng’s control and that I don’t see addressed in this post, and that sadly hasn’t gotten traction in the past: the wait time for *actual* builds to start. It’s nice that we’re committed to 15 minutes for a build to start on a slave, but how good is it if it takes another 20 minutes for the actual build to start (where by actual build, I mean make -f client.mk being started).

    Absolutely agreed. Armen is taking on bug 712206 as a Q4 goal which will do away with things like unnecessary tools repo deletion and re-cloning, clobbering, etc. This will allow slaves to return their respective pools ready to do work, rather than spending 20 minutes getting ready to do work.

  5. Some ideas that might help with capacity:

    * Partial Try modes. One that picks a single platform for all builds and tests, and another that builds on all platforms but picks a platform for each test. https://bugzilla.mozilla.org/show_bug.cgi?id=674751

    * Automatically group, test, and land non-conflicting patches from Try. This would reduce the need for full Try runs: patches would get partial Try tests on their own and full Try tests just before they land. https://groups.google.com/d/msg/mozilla.dev.platform/9f3IVMp0KDs/SQj_Kc4gQbMJ

    * TBPL-integrated bisection, so we can run mostly-redundant tests (e.g. OS X 10.6) less often. Give sheriffs an emergency backfill button (run up to 24h of skipped jobs on a given tree) for when they can’t wait for bisection.

  6. > “I feel like you’re wanting to call out a specific bug number here.”

    No bug number, sorry. I think I was referring to the “poorly optimized” part of what I quoted, and to the fact that you (as opposed to the people that have written the code and tests over time) have an aggregate view so you can focus on the most poorly optimized parts.

  7. Jesse: in bug 793989, we have asked for the ability to schedule jobs on a given build that have never been scheduled before. Given that ability we could implement the “don’t run 100% of tests on all platforms for all pushes” mentality. It would give us the ability to automatically bisect and find a failure if a test fails after not being run for XX hours.

  8. And wrote:
    > No bug number, sorry. I think I was referring to the “poorly optimized” part of what I quoted, and to the fact that you (as opposed to the people that have written the code and tests over time) have an aggregate view so you can focus on the most poorly optimized parts.

    Releng is mostly content-agnostic when it comes to builds and tests. The build system and test harnesses are written and maintained by other groups, we merely provide the automation framework to run them.

    It’s *possible* we have people on the releng staff who have the skills to tackle those problems directly, but it’s about the most effective use of leverage. Better to engage the A-Team or other contributors who have more experience with the particular frameworks and content of the tests themselves.

    Now, there are parts of the automation itself that could use improvement, as glandium and Jesse have suggested. Releng is quite able and willing to tackle those problems, while also lobbying for improvements to the tests themselves.


Leave a Reply