Posted 3 years, 1 month ago at 22:30. 4 comments
For those who want to skip to my specific proposals — there are 6 — for reclaiming space on stage.mozilla.org, please skip ahead to “Redux”, but if you’re going to comment, please read the whole thing.
Everyday we produce up to 17G worth of new nightly builds for Firefox across all branches. This includes all opt+debug builds in 75+ locales, each on 4 operating systems (OSes), each on 9 different project branches. We do reclaim much of this 17G as we retire l10n builds older than 1 week, but we are still creating 1.3G of new nightly en-US builds that we need to store (essentially) indefinitely. nthomas did some cleanup recently as part of bug 562261 and that has bought us some more time, but this inexorable increase will eventually overrun our disk capacity on the staging server. If we add to this disk usage by nightlies from other products which may have worse nightly hygiene habits and the expected increase in space requirements every night as we add 4 new OSes (adding Linux 64bit, Windows 64bit, OSX10.6 64bit and Android), the problem is magnified.
To date, our solution to this problem has been to do periodic cleanups, usually under duress like bug 562261 (which can be error-prone), or to simply buy more disk. As Justin notes, while disk space may be “cheap”, it is not an infinite resource, either in terms of upfront or management cost. We need an actual policy to govern how long we keep nightlies. We can then use that policy as a baseline to frame further discussions about keeping things for longer periods in special cases.
The good news: there *is* space that can be reclaimed: There are two types of wins to be had here:
- One-time space recovery by deleting or archiving material that is no longer useful, or that can be kept offline and spun up as required.
- Codified policy changes for automatically expiring old content.
We can tackle #1 (one-time space recovery) for Firefox by:
- moving no-longer-supported releases, i.e. anything prior to 3.0 (including firebird) to a true archive. This will free up about 125G of space, and will have the added benefit of not housing unsupported builds next to supported builds, making them a little more difficult for people to stumble upon.
- deleting nightly builds older than a certain date. The Firefox nightly directories are conveniently broken down by year, so it makes it easy to see how much disk space we could reclaim by deleting old nightlies:
2010 216G (so far)
At the risk of making others’ arguments for them, both previous times we attempted to come up with a stage cleanup policy (2006 & 2008) the major concern raised was a need to keep builds in perpetuity to allow regression detection via binary search.
So I say this: I’d like to hear from developers who have actually had to perform binary searches of nightlies to let me know exactly how far back in time they have had to go. In the absence of any other data, I’m going to suggest we remove all nightlies prior to 2007 simply because that corresponds with the start of the hg era (March 2007).
Please note, that when I say “remove” in this context of this discussion, I am advocating for “delete” but understand that I may need to settle for “archive” in whatever form that makes sense to IT (slow disk/tape/???).
Other non-Firefox projects will need to make their own decisions as to how/when to archive older releases, but those projects could also reclaim a lot of space by deleting their older nightlies. Here are the aggregate disk usage number for nightly builds of Calendar (both Sunbird & Lightning), Camino, Mozilla Suite (not even built since 2007), and XulRunner, broken down by year:
Here are the nightly usage numbers for Thunderbird:
2010 159G (so far)
Again, there are big space recovery wins to be had here, depending on far back we want our accessible, online repository of nightly builds to be.
In terms of policy changes to curb accumulation, I propose implementing three reforms, the first two of which come out of nthomas’ work in bug 562261.
First, we’ll script the automatic expiry of mar files for nightly builds older than 1 month. Only the most recent complete and partial MARs are required, so this gives us a more-than-adequate buffer to detect and fix problems with the nightly update system. We have proven steps from bug 562261 that can be easily cron-ed to run weekly on weekends or other periods when the staging server is (relatively) idle. This can be done across all products.
Second, the RelEng team will start purging the contents of old candidates directories as part of our release process. For those who are unaware, the candidates directory lives under the nightly directory and holds all the various release files (builds, source, signatures, logs) until we green-light the release. Once the release is official, the important contents are sync-ed over to the releases/ subdir and the candidates dir becomes mostly redundant, modulo a few important logging artifacts. We’ll delete the builds for all but the two most recent candidate dirs, but will preserve the text files/logs that tell us important things like # of builds/changesets/build IDs.
Note that this won’t be an automated procedure: the release engineer responsible for the current release will need to go in and look at the candidates directories involved and make a judgment call as to what to delete. Sometimes the release procedure goes awry or we try something new, and it’s important to be able to keep those examples around until we’ve learned what we can from them.
Other projects that currently use a candidates directory for releases should also consider making this change.
Third, we should agree to revisit nightly storage on a yearly basis. Specifically, we should commit to taking the oldest year’s worth of nightly builds offline in January of each year, e.g. if we’re comfortable with a 3-year online repository of nightly builds, in January 2011 we would take the nightly builds for 2007 offline. As you can see from the YTD numbers for 2010, this is unlikely to actually keep up with increasing storage needs, but it is better than nothing.
Here’s a brief summary of my proposals for reclaiming space on stage to make it easier for people to respond *AFTER* reading the above:
1) Move Firefox releases that are no longer supported (< 3.0, including firebird) to separate storage.
2) Remove Firefox nightlies prior to 2007, freeing 260G. These can be deleted if they're not going to be used, or archived if we think they might.
3) Remove nightlies for products other than Firefox prior to 2007, freeing 174G. Again, "remove" can mean either deletion or archiving.
4) Automate the deletion of nightly MAR files older than one month. Only the most recent MAR files are required. This would be done across all products.
5) Delete builds from older candidates directories after official release. This will reclaim up to 13G per build attempt per release. This will be a manual process.
6) For every new year going forward, remove the oldest remaining year of nightlies, e.g. for a 3-year history of nightly builds, remove nightly builds from 2007 in January 2011. This will be a manual process.
Feedback is appreciated, either in the newsgroups or in bug 342972.