Managing build team expectations: tinderbox support from IT

Posted 1 year, 8 months ago at 17:26. 2 comments

As of today, the Mozilla build team is going to be handing off all front-line tinderbox support to the Mozilla IT team.

The Why:

The Tinderbox build farm is still the core of our developer infrastructure, providing developers with build access to platforms they might not otherwise have. Sadly, we don’t always pay sufficient attention to it.

The build team is stretched pretty thin dealing with a release cycle that seems at times perpetual. We have the core releases on two branches, a multitude of partner builds on both branches, preview releases on Trunk, and updates to generate for all of the above. Day-to-day tinderbox maintenance frequently falls through the cracks. Unless a failed tinderbox is in the critical path for that week, it often goes unfixed. In the past, squeaky wheels in #build would sometimes get the grease. This may have created the false expectation that the build team was monitoring these builds in the evenings and over the weekends. This was never the case, and when we kicked boxes at those times, it was because we happened to be around. That mode of operation should fail outside of regular business hours, and is not very sustainable staffing-wise to begin with.

Enter IT. IT is staffed by smart, technical, sysadmin-types who are trained to keep servers up-and-running. Perhaps more importantly, IT already has an established on-call rotation for dealing with important issues quickly outside of the standard workday.

The How:

Over the last few months, the build team has worked with IT to implement nagios monitoring for all our key tinderbox systems. We have divided up the build farm into a handful of service tiers. We are no longer reliant on helpful hints in #build that key tinderboxen have gone down. Tier 1 tinderboxen will receive immediate attention from the IT staff member on-call if they stop building, be it in the middle of the workday or the middle of the night. The service levels will look as follows:

Tier 1 24-hour, on-call support
Tier 2 Support during business hours which equates to Monday - Friday, non-US hoildays, 9 am to 6 pm PST/PDT
Tier 3 As time permits (may be multiple days/weeks, depending on request/issue)

Of course, due to the Tinderbox server architecture, IT won’t be paged by the nagios monitoring until a build falls off the waterfall page. This typically takes 12 hours. There is still the need for vigilance here, especially if the tinderbox you happen to care about isn’t ranked as Tier 1 or isn’t monitored by nagios at all.

For these situations, and indeed for all other tinderbox inquiries going forward, filing bugs in Bugzilla against IT with the appropriate severity is the correct way to proceed. As I’ve mentioned before, IT is very good about managing their bug queue. Your request will be heard.

The build team is not washing its hands of tinderbox. We will be there to help IT with tinderbox as required, but we *will* be taking a step back in order to concentrate on other things: improving the release process, switching version control systems, and such.

It may take a little while to work out the kinks in this new system, but we’re confident that this arrangement will lead to a better tinderbox experience for everybody. We ask that you extend IT the same courtesy you would afford the build team as they learn the tinderbox ropes over the next few months…or you could be nice to them instead. ;-)

Current Tunes: Junior Jack and Kid Creme - Essential Mix - 2004/05/23 | Filed under Build/Release, Mozilla |

2 Replies

  1. Two points:

    Why isn’t argo-vm a tier 1 service like fx-win32-tbox and bm-xserve08 are?

    And who exactly is on the build team and who is on the IT team? I think I’ve been confusing the two at times…

    Keep up the good work on build-related matters :-)

  2. Why isn’t argo-vm a tier 1 service like fx-win32-tbox and bm-xserve08 are?

    Um, off-by-one error? ;)

    Fixed.