Summit takeaways for release engineering

Posted 10 months, 3 weeks ago at 20:26. 4 comments

Mozilla Summiteer 2013 Toronto BadgeIt’s hard to boil down 3 days of intense discussions to a few bullet points, much less discussions had by 18 people over 3 Summit locations, but here I go.

First, did you know that Mozilla release engineering (releng) is now 18 full-time staff, plus up to 3 interns at any given time? It still boggles my mind, and I live it (and manage some of it) every day.

Specific takeaways from the Summit for releng:

  • Respect: John O’Duinn had a great blog post right before the Summit on this topic, and the timing was not coincidental. As we all go our separate ways after the Summit, think about how we can all continue to engender that feeling of community, and how we can all continue to assume positive intent.
  • Response times: I ended up in many sessions with David Eaves in Toronto. He was very focused on response times as the *most* important metric concerning continuing community contribution: if you can respond to incoming contributions within hours, there is a very high likelihood that that particular contributor will contribute again, but if that response time trends into days or week, the chance of follow-up contribution is _minimal_. This is something that needs to be baked into our tools at a fundamental level. I’m looking at you, Bugzilla. New contributors will contribute where there is the best chance of feedback and success. You incentivize all the right things for Mozilla in this model.
  • Scale: this has strong ties to the previous item about response times, but what happens when releng systems need to contend with 1000x, or even 5x, the current code contribution load? The current system is already leaking at the seams with the current contribution load. How do we design for many more technical contributors? There are some essential design decisions we need to meet — I don’t think buildbot gets us where we need to be — but it boils down to: automate all the things! All the current hands-on activities for developers and especially sheriffs, need to happen automatically. This includes landing, back-out, and bisection.
  • Remoties: members of releng gave different versions of this talk at all 3 Summit locations. Attendance at all 3 of these sessions was skewed heavily towards people that already work remotely versus in an office, so in many ways it was like preaching to the choir. I know in Toronto we all seemed to have similar coping strategies for the problems of working remotely. How best to engage office workers who don’t even recognize this as an issue? One idea I came up with post-Summit was to send local workers to a remote office for a week sometime within their first 3 months of work. Releng is reasonably good about bringing new remote employees into a office location when they start, but I think sending office-based workers to a remote location for a week, even if another office, would give them a better appreciation of the difficulties inherent in working with their peers from somewhere else.
  • Rust/Servo: lots of great stuff going on in terms of Mozilla’s next-gen browser. Our main takeaway here is that there are a bunch of interesting, custom continuous integration pieces used by this project that have no current analogue elsewhere. It will be releng’s challenge to adapt and/or support these systems as they become more mainstream.
  • Live Build/Test logs: this has been a long-time ask for both sheriffs and interested developers. Historically, it’s been quite expensive for us to expose this, though not impossible. More work and investment in tooling may help us provide this service in the future.
  • git vs. hg: Mercurial(Hg) remains the version control system of record for Mozilla products like Firefox, but there is no denying the critical mass built up around git, and more specifically, github. In fact, when most people advocate for git, it has been releng’s experience that they are really advocating for the new collaboration model enabled by github. It’s an important distinction to make as we make choices going forward. We currently try to mirror everything from hg into equivalent git repos, but I don’t know how long a dual-VCS model will remain tenable as we scale.
  • Release Management vs. Release Engineering: a lot of people we talked to had trouble making a distinction between the two groups. No doubt this is made harder by the transition of Lukas Blakk from a role in releng to a position as a Release Manager. Historically speaking, there was a lot of overlap between the two groups. Since the split, I can best sum up the division of responsibilities as such: Release Management deals with the decisions around code quality, patch uplift and acceptance, and the health of any given release branch (specifically beta and release) when it comes to building new versions of our products, whereas Release Engineering provides the plumbing to make those decisions possible – continuous integration, release, and update infrastructure at whatever scale is required.
  • QA testing on releng infra: Apparently QA sets up and runs a lot of their own infrastructure for testing. As much as possible, we should enable their testing to happen on releng infrastructure. We already have many of the scaling issues figured out.
  • AWS: met a lot of great people involved in Mozilla’s cloud services initiatives, many of whom are in those roles because of past experience with other providers like Amazon. We received lots of great feedback about how to minimize costs by balancing instance types and reserved instances. These tips and rules of thumb will prove invaluable as we continue to leverage cloud services going forward. Releng will be sure to publish our accumulated knowledge here as we go. This seems like really important insight to give back to the community.

There are undoubtedly things I’ve left out. There were lots of individual experiences that don’t roll-up as well, but were just as important. For one, I thoroughly enjoyed playing in the first ever Summit series and I’d love to see more of that at future Summits, regardless of the sport involved.

I’m curious to know what priorities other groups took away from the Summit.

Current Tunes: deadmau5 - October | Filed under Build/Release, Firefox, Mozilla, Open Web |

4 Replies

  1. Dustin J. Mitchell Oct 23rd 2013

    Buildbot-0.8.2 is certainly showing its age, but newer versions are better, and 0.9.0 promises to scale to the levels releng will need. It will also enable live logs, true clustering, DB-backed status, and a slew of other useful features. Any support Mozilla can provide to that effort will only help it be ready sooner!

  2. We’re planning to do a buildbot sprint during our teamweek in November to see how much work would be involved in upgrading.

Leave a Reply