Almost every single Firefox user on Linux gets their builds directly from the various distributions. Ubuntu, Fedora, Red Hat, Debian (down-branded), Novell, Foresight, etc. And in those cases users generally have a pretty good experience. But that’s not always the case.
I’ve always seen this position as both good and bad. It’s good for the distribution end users since they don’t have to go and find a build of Firefox. And it’s good for Mozilla because it means that we don’t have to produce builds for N Linux distributions, which is basically an impossible task. It also means that distributions can make late-breaking fixes that are specific to their distribution that really affect their particular user base.
But it’s bad, too. It means we’re disconnected from users on those various Linux distributions. We’re at the mercy of the distributions to make updates in a timely manner, and very often we find ourselves chasing them to make updates that they clearly should be doing. For those users where we have a direct relationship we have a pretty good track record of making timely updates to almost our entire user base. Linux users are cut off – quite intentionally because that is the classic “value add” for a Linux distributor – from our update train, sometimes leaving them vulnerable for weeks or months. (Note that this does not generally affect the top-tier Linux distributors like Canonical, Red Hat and Novell. They are actually excellent at delivering updates because they have dedicated engineers who only have the job to chase Mozilla and be ready when we’re ready to deliver an update – either a firedrill or planned update release.)
There’s another large downside as well. Distributions often make changes to how Firefox is built – be it compiler optimization changes, linking with system libraries instead of the ones that we ship with, adding their own large patches to add support for some random feature or making changes to the default look and feel of the browser.
Contrary to uninformed hyperbole Mozilla actually does a huge amount of testing on Linux. It’s one of our three top tier platforms and we run it through all the same regression and performance testing that the other platforms get. You can see this by looking how much attention we pay to how to tune the compiler to give us the best performance (hint: more -O doesn’t always make us faster, nor does architecture-specific tuning!) or how much time we’ve spent on the reported fsync issues that have affected quite a few people.
Because of the amount of work that we do on Linux and how closely we work with upstream projects (sqlite, cairo, etc) we’re still the experts on what works and what doesn’t. And because we have a pretty full set of tests that we go through we know what versions of upstream projects work well inside of Mozilla. Note that this doesn’t mean we know which versions don’t work with Mozilla, as I will illustrate later. We can’t be compatible with every single version of every upstream project with every single possible configuration, it just doesn’t work.
I’ll use a specific case in point here to illustrate what I’m talking about with Fedora + Red Hat. (Note that I’m pointing this out because it’s a real situation, not that I think that the Fedora + Red Hat guys are doing a bad job – they actually do a fantastic job given the task as far as I’m concerned. The issue I’m about to describe does not affect Fedora 8 or Fedora 9 users – only those who happen to be using Fedora Rawide – the bleeding edge of the bleeding edge.)
Chris Aillon and Matthias Clasen were reporting an issue to me where Firefox was hanging for long periods of time in Fedora Rawhide while opening the history tab. I figured that it was the same old fsync-related problem but they were reporting that it was happening for long periods of time (30 seconds in a lot of cases) and it was happening on systems that were relatively unloaded in terms of IO. I was near the Red Hat office on a personal errand and I thought I would stop by and try to help diagnose the issue. Looking at the issue in a debugger I found that it was hanging down in sqlite and not returning into Mozilla-specific code at all. I also noticed that they were linked against the system sqlite instead of the one that we ship with. I asked Matthias to try a Mozilla-built Firefox on his machine with his profile and it did not have any problems. When Chris generated a build for Fedora Rawhide that used our internal sqlite version it also didn’t have any problems.
It turns out that the sqlite version that’s included in Rawhide, version 3.5.8, has a bug in a particular type of query that Mozilla uses extensively. When Mozilla updated to that version of sqlite our automated testing picked up on the problem and the change was backed out of our tree. Let’s look at the order of operations that caused this particular issue.
- Mozilla checks in a patch to upgrade to sqlite version 3.5.8.
- Fedora Rawhide notices the new requirement in configure and bumps their system sqlite version to 3.5.8.
- Mozilla’s automated testing picks up failures as a result of the new sqlite version and backs out the changes.
- That backout is missed by the Fedora folks and they are left linking against an sqlite version that contains the problem.
This isn’t the worst example of what goes on when distributors are making changes to upstream software. The impact here was pretty minor – only a few users were affected and the bug was pretty obvious to a large number of people. It does get worse. Ask Debian and Ubuntu users how happy they are about regenerating keys in light of the OpenSSL issues that were recently found with the downstream patch. (I realize that’s an oversimplification of that particular issue but it has a lot of the same qualities as this example.)
This is a real problem, one that we’ve even successfully predicted.
So how does this relate to the fsync issue? Well it shows the opposite end of the of the patch spectrum. Basically every single Linux distribution is waiting for a good fix to that particular problem. And they will all ship a fix to their users. So sometimes distro-specific patching is a good thing.
The trick has to be finding the balance. Right now we know that there are a lot of instances where bad or ignorant decisions are being made. (Just because an option exists in ./configure doesn’t mean that you should use it!) People clearly aren’t taking advantage of Mozilla’s automated testing facilities – in the Fedora example the problem would have shown up pretty quickly if they were running the same tests Mozilla does. And the flows of information between Mozilla and the various distributions is ad-hoc at best. Very often more effort is spent debugging the blame instead of debugging the actual problem at hand. I’ve been on the receiving end of that recently and it’s certainly soul-crushing.
There’s also no easy answer to the multiple-library-version problem, either. Once again, we’re not going to be compatible with everything everywhere, especially on Linux where the platform is more like quicksand than green grass. Just screaming “you should always link against system libraries!!!!” isn’t going to work when the size and complexity of Linux continues to expand without any contraction of the complexity involved in which Linux you should target. That with blind version updating means that we’re just going to be stuck with multiple versions of libraries – assuming you want a quality product that works as well as it does on Windows and the Mac and does so consistently.
So if I had to wrap up this post with some lessons learned I would put them down as this:
- Don’t change a default configure option unless there’s a very good and very specific reason for it. No, really, don’t touch the optimization flags – we’ve probably tuned them to an inch of the compiler’s life. But if you actually run tests and find something faster, let us know.
- Don’t ship a patch unless it’s been vetted by upstream.
- Ship known good patches if it will help your users. But don’t do it without talking to us first. Remember, it’s possible – and likely – that we’ve known about your issue for quite a while. And while you’re at it if you had to fix something consider adding a test to our test suite so it won’t happen again?
- If you do want to carry some heavy patches consider following the pattern that the enterprise distributions are using: they work together (at least Ubuntu, Novell and Red Hat) to carry patches for some pretty old Firefox releases. And have been relatively successful doing it.
- And finally, remember that there’s a fine line between adding real value and making a change for change’s sake. The former is always encouraged (within limits!) but the latter often causes more trouble than it is worth.
In closing I think I would just like to say that there’s a lot of work to be done here. And it’s something that will need constant adjustment – there’s no one set of rules that we can develop and expect them never to change. Both Linux distributors and Mozilla can do better than they have done to date in terms of making things better for users on Linux. Thinking about changes, improving communication, understanding the reason for changes or diversion from upstream, etc.
And all of that discussion has to come from the standpoint of making sure that the user’s experience is improved. If it doesn’t improve their experience in a very tangible way, it’s probably not worth doing.