Almost every single Firefox user on Linux gets their builds directly from the various distributions. Ubuntu, Fedora, Red Hat, Debian (down-branded), Novell, Foresight, etc. And in those cases users generally have a pretty good experience. But that’s not always the case.
I’ve always seen this position as both good and bad. It’s good for the distribution end users since they don’t have to go and find a build of Firefox. And it’s good for Mozilla because it means that we don’t have to produce builds for N Linux distributions, which is basically an impossible task. It also means that distributions can make late-breaking fixes that are specific to their distribution that really affect their particular user base.
But it’s bad, too. It means we’re disconnected from users on those various Linux distributions. We’re at the mercy of the distributions to make updates in a timely manner, and very often we find ourselves chasing them to make updates that they clearly should be doing. For those users where we have a direct relationship we have a pretty good track record of making timely updates to almost our entire user base. Linux users are cut off – quite intentionally because that is the classic “value add” for a Linux distributor – from our update train, sometimes leaving them vulnerable for weeks or months. (Note that this does not generally affect the top-tier Linux distributors like Canonical, Red Hat and Novell. They are actually excellent at delivering updates because they have dedicated engineers who only have the job to chase Mozilla and be ready when we’re ready to deliver an update – either a firedrill or planned update release.)
There’s another large downside as well. Distributions often make changes to how Firefox is built – be it compiler optimization changes, linking with system libraries instead of the ones that we ship with, adding their own large patches to add support for some random feature or making changes to the default look and feel of the browser.
Contrary to uninformed hyperbole Mozilla actually does a huge amount of testing on Linux. It’s one of our three top tier platforms and we run it through all the same regression and performance testing that the other platforms get. You can see this by looking how much attention we pay to how to tune the compiler to give us the best performance (hint: more -O doesn’t always make us faster, nor does architecture-specific tuning!) or how much time we’ve spent on the reported fsync issues that have affected quite a few people.
Because of the amount of work that we do on Linux and how closely we work with upstream projects (sqlite, cairo, etc) we’re still the experts on what works and what doesn’t. And because we have a pretty full set of tests that we go through we know what versions of upstream projects work well inside of Mozilla. Note that this doesn’t mean we know which versions don’t work with Mozilla, as I will illustrate later. We can’t be compatible with every single version of every upstream project with every single possible configuration, it just doesn’t work.
I’ll use a specific case in point here to illustrate what I’m talking about with Fedora + Red Hat. (Note that I’m pointing this out because it’s a real situation, not that I think that the Fedora + Red Hat guys are doing a bad job – they actually do a fantastic job given the task as far as I’m concerned. The issue I’m about to describe does not affect Fedora 8 or Fedora 9 users – only those who happen to be using Fedora Rawide – the bleeding edge of the bleeding edge.)
Chris Aillon and Matthias Clasen were reporting an issue to me where Firefox was hanging for long periods of time in Fedora Rawhide while opening the history tab. I figured that it was the same old fsync-related problem but they were reporting that it was happening for long periods of time (30 seconds in a lot of cases) and it was happening on systems that were relatively unloaded in terms of IO. I was near the Red Hat office on a personal errand and I thought I would stop by and try to help diagnose the issue. Looking at the issue in a debugger I found that it was hanging down in sqlite and not returning into Mozilla-specific code at all. I also noticed that they were linked against the system sqlite instead of the one that we ship with. I asked Matthias to try a Mozilla-built Firefox on his machine with his profile and it did not have any problems. When Chris generated a build for Fedora Rawhide that used our internal sqlite version it also didn’t have any problems.
It turns out that the sqlite version that’s included in Rawhide, version 3.5.8, has a bug in a particular type of query that Mozilla uses extensively. When Mozilla updated to that version of sqlite our automated testing picked up on the problem and the change was backed out of our tree. Let’s look at the order of operations that caused this particular issue.
- Mozilla checks in a patch to upgrade to sqlite version 3.5.8.
- Fedora Rawhide notices the new requirement in configure and bumps their system sqlite version to 3.5.8.
- Mozilla’s automated testing picks up failures as a result of the new sqlite version and backs out the changes.
- That backout is missed by the Fedora folks and they are left linking against an sqlite version that contains the problem.
This isn’t the worst example of what goes on when distributors are making changes to upstream software. The impact here was pretty minor – only a few users were affected and the bug was pretty obvious to a large number of people. It does get worse. Ask Debian and Ubuntu users how happy they are about regenerating keys in light of the OpenSSL issues that were recently found with the downstream patch. (I realize that’s an oversimplification of that particular issue but it has a lot of the same qualities as this example.)
This is a real problem, one that we’ve even successfully predicted.
So how does this relate to the fsync issue? Well it shows the opposite end of the of the patch spectrum. Basically every single Linux distribution is waiting for a good fix to that particular problem. And they will all ship a fix to their users. So sometimes distro-specific patching is a good thing.
The trick has to be finding the balance. Right now we know that there are a lot of instances where bad or ignorant decisions are being made. (Just because an option exists in ./configure doesn’t mean that you should use it!) People clearly aren’t taking advantage of Mozilla’s automated testing facilities – in the Fedora example the problem would have shown up pretty quickly if they were running the same tests Mozilla does. And the flows of information between Mozilla and the various distributions is ad-hoc at best. Very often more effort is spent debugging the blame instead of debugging the actual problem at hand. I’ve been on the receiving end of that recently and it’s certainly soul-crushing.
There’s also no easy answer to the multiple-library-version problem, either. Once again, we’re not going to be compatible with everything everywhere, especially on Linux where the platform is more like quicksand than green grass. Just screaming “you should always link against system libraries!!!!” isn’t going to work when the size and complexity of Linux continues to expand without any contraction of the complexity involved in which Linux you should target. That with blind version updating means that we’re just going to be stuck with multiple versions of libraries – assuming you want a quality product that works as well as it does on Windows and the Mac and does so consistently.
So if I had to wrap up this post with some lessons learned I would put them down as this:
- Don’t change a default configure option unless there’s a very good and very specific reason for it. No, really, don’t touch the optimization flags – we’ve probably tuned them to an inch of the compiler’s life. But if you actually run tests and find something faster, let us know.
- Don’t ship a patch unless it’s been vetted by upstream.
- Ship known good patches if it will help your users. But don’t do it without talking to us first. Remember, it’s possible – and likely – that we’ve known about your issue for quite a while. And while you’re at it if you had to fix something consider adding a test to our test suite so it won’t happen again?
- If you do want to carry some heavy patches consider following the pattern that the enterprise distributions are using: they work together (at least Ubuntu, Novell and Red Hat) to carry patches for some pretty old Firefox releases. And have been relatively successful doing it.
- And finally, remember that there’s a fine line between adding real value and making a change for change’s sake. The former is always encouraged (within limits!) but the latter often causes more trouble than it is worth.
In closing I think I would just like to say that there’s a lot of work to be done here. And it’s something that will need constant adjustment – there’s no one set of rules that we can develop and expect them never to change. Both Linux distributors and Mozilla can do better than they have done to date in terms of making things better for users on Linux. Thinking about changes, improving communication, understanding the reason for changes or diversion from upstream, etc.
And all of that discussion has to come from the standpoint of making sure that the user’s experience is improved. If it doesn’t improve their experience in a very tangible way, it’s probably not worth doing.
-
I enjoyed reading this post; good points without getting too ranty. I was wondering how easy it is for a distribution to run the mozilla test suite on their own builds?
It seems like getting distros to run the “mozilla approved test suite” on their own builds would help catch problems with the customizations. If the process of running the test is complex it would explain why it doesn’t get run. I have a feeling most people maintaining downstream would also feel better running the official tests.
-
The problem with a distro using supplied versions of libraries is two-fold:
1) Drive space usage and download times.
2) Security updates then need to be tracked across multiple embedded versions of software.
This point is subtler than one thinks at first. A particularly bad culprit for this is zlib. zlib is embedded in so many different packages, that at one point when Ubuntu had to do a security fix, it was still finding packages to update a couple weeks later. That’s why the decision was made to aggressively remove multiple instances of software.
While in this particular case, Fedora got bit by doing the same thing, in a way it’s good to have it happen. Assuming that you guys had filed the bug upstream with sqlite, they’d go through some debugging and find it there (and probably also find the same bug in Mozilla’s bugzilla). They might take the time to fix the bug properly, or revert the patch from the system copy of sqlite. In the latter case, all Fedora users win, because they don’t have a painfully slow query; In the former case, all users win because the bug gets fixed properly. Most projects would even quickly put a new version out the door if they found that they were clobbering a large well-known package like FF (I don’t know the sqlite folks at all)
If a distro chooses to use the bundled versions, none of the benefits come to play, and all of the problems set in. While it maximizes end-user experience for direct downloads, it would be a mistake for distros to follow that path.
-
> Just screaming “you should always link against system libraries!!!!” isn’t going to work when the size and complexity of Linux continues to expand without any contraction of the complexity involved in which Linux you should target.
Just pointing to the size doesn’t mean the complexity won’t work. But the answer has to be target something, make sure it works, and leave it to the various distro people to figure out if it is their distro that doesn’t work.
If you don’t link to the system libraries you make the scalability problem worse, and you end up like Microsoft who were releasing exactly the same code correction in different libraries versions and different software for over 3 years.
I suspect one of the issues here is Ubuntu shipping software that just “isn’t ready”, be it sqlite, or FF betas. There is a reason Debian sells a t-shirt that says “good things come to those who wait”, and ships when the release critical bugs are fixed, even if they sometimes miss some, or introduce others….
-
I think one issue here is that sometimes Mozilla is thinking about what will make Firefox look best, but not what will make the whole system be best.
Take the “use system copies of libraries or not” issue; sure, using system copies can introduce the occasional bug not in the internal copies. If two pieces of software are different, they will have different bugs. True fact.
But if every app on the system has internal copies of libraries, here are some of the negatives:
1) a security update to the lib will be an enormous download and an enormous amount of work as the user pulls down basically every app on the system again (in particular for popular libs like cairo or sqlite)
Let’s imagine the Debian openssl bug, but there were 20 different copies of openssl on a Debian system, with different patches in each. A better situation? Doubtful. Imagine you’re the openssl upstream trying to figure out the Debian local patches in that case…
Most security bugs are *not* in distribution patches. The Debian thing was an anomaly both in severity of bug and in being distribution-specific.
2) the system libs often contain important bugfixes too. For example, cairo has deep interactions with X. The X.org developers could even legitimately claim that it’s vital for everyone to use the Cairo that has been tested against a particular X server version. Now say Mozilla has tested with Cairo version A and it works best, but the X.org in a distribution has been tested with version B and it works best. Maybe the X.org upstream is even demanding everyone use Cairo v. B, or version A crashes the X server or is super-slow.
3) disk space and memory usage from tons of copies of libs would be significant for most of the major desktop libs.
Those 3 things are very clearly worse overall for users than the occasional bug that gets caught in Rawhide (or even the occasional slow-as-heck pango patch).
The bottom line is that integrating a bunch of software is hard. Mozilla is integrating a bunch of software. Distributions are taking that bunch of software and integrating it with even more software.
It’s basically nonsense that if distributions just downloaded all the pristine upstreams and compiled them, the result would be good. In fact the result is empirically known to be terrible (there are some distributions that roughly do this, and they suck).
So, distributions are adding value. That’s why everyone uses distributions. We can nitpick around the edges around exactly how to do the whole-OS integration, but the reality is that whole-OS integration is necessary, and that it involves patching upstream software.
The only way to fix this would be to have only one Linux distribution that everyone contributed to, much as there’s only one Windows or Mac OS. Then “upstream” would be the same as the distribution.
In the meantime I think the realistic discussion is about how to improve communication, get patches upstream and reviewed quickly, and so forth. And in the best case, even have upstream *co-maintain* the distribution packages, which is completely possible with many distributions, or close to it.
The debate about “is it OK to patch things in distributions” is counterproductive in my mind. It is *necessary*. You don’t just download tens of millions of lines of unrelated code and expect it to work together unmodified, let alone expect it to be a good user experience.
So the debate shouldn’t be about whether OS integration is needed. It should be about the particulars of each patch, and the communication around each patch with upstream. Many of the distributions are pretty much open projects themselves – for the major ones, there’s no reason Mozilla couldn’t be intimately involved in exactly how they release Firefox.
I think this stuff usually just comes down to honest technical agreement on which changes make sense, or else failure to talk at all because everyone is busy.
-
@Pete: You should check out mozilla/tools/buildbot-configs/testing/unittest for a look at our testing buildbot configuration (from cvs-mirror.mozilla.org, natch). The master.cfg and mozbuild.py files therein should show you what you need to run. Feel free to send me an email to robcee at mozilla dot com if you have any questions about running the individual pieces or would like to setup your own buildbot. It’s shockingly easy!
(and maybe fodder for a blog post of my own)
-
Either I don’t understand what you’re saying or I do and I couldn’t disagree more.
I fail to see how a user would be happier if Firefox in Fedora was embedding SQLite.
It’s better that the bug exists in the *shared* library, so that it can be fixed or workarounded for _all_ applications at once.The more users use a particular library the better, because bugs can be discovered that affect everyone and they can be fixed for everyone.
The another important issue at hand, the security implications, was already mentioned and analyzed.
I don’t need to tell you that free software works as an ecosystem and that we should all work towards a better system as a whole and not for a better little corner of ours.
Also, something that personally annoys me, but a bit off-topic:
You mention that all distributions modify Mozilla’s patches and apply their own customizations; however, Debian is the only one that was forced to rename the package for exactly that reason. Why is that? -
Faidon–this is really off topic for this blog post, but… Debian had to rename Firefox because of the incompatibility between its licensing policy (all works must be freely modifiable) and Mozillas (the Firefox artwork may not be modified, and the Firefox name may not be used without the Firefox artwork). Let’s not waste any more space on this page discussing this red herring. :)
-
From your narrative and the comments in 429336, it’s not clear whether or when Mozilla worked with the sqlite developers to fix this regression, in addition to rolling back the update. Wouldn’t it be better for everyone if the regression were fixed upstream so it could benefit both Firefox and other applications using the system-wide sqlite?
-
Thanks for the interesting post. I had massive problems with firefox on ubuntu after upgrading to hardy. I had been running nightlies before the upgrade and couldn’t figure out why beta 3b5 would perform so much worse on linux than any of the nightly builds I had seen. I guess it was because after the upgrade I switched to the distribution’s package which was probably linked to system libraries as you explained. The complex interplay between so many different parties (upstream and down) really amazes me.
-
+1 to what Havoc and others are saying, distros will always link to system libraries. It doesn’t make sense to ship dozens of copies of SQLite just because each application was tested with a different one.
As for “That backout is missed by the Fedora folks”, I don’t think that’s really the explanation (the maintainers actually involved with the affected packages can tell you more; I’m maintaining some packages in Fedora, but none related to this issue). See, in distributions, we can’t just downgrade software at a whim, for several reasons:
1. Package managers decide what package is newer by its version. If we want to revert a package, we have to either break the upgrade path (this might be acceptable at this stage of Rawhide, but in other situations it isn’t, and some distributions like Debian have automated enforcement which forbids doing this ever, at any development stage) or use an ugly hack called the “Epoch”, which causes other problems (for example, if package foo has Epoch 1 and if you require foo >= 3.5 and forget to specify the Epoch, you’ll end up with foo 1:3.4 (3.4, also with Epoch 1) matching, which is clearly not what you want; also packages might have required the new version before the reversion, and the reverted version will match for the same reason as before, which is wrong).
2. Many libraries are only backwards-compatible, not forwards-compatible. So if we build packages against foo 3.1.4 and then revert to 3.1.3, everything built against 3.1.4 might be broken. And given that one-way-compatible libraries don’t involve a soname bump and that only few libraries use symbol versioning (for portability reasons), the package manager has no way to know that it’s broken, potentially causing many subtle bugs at runtime which are very hard to track down. (The most obvious one is an application failing to start with an unresolved symbol, but there can be more subtle breakage.)
3. We can’t always downgrade a library just to make one application happy, there are others which may require the new version or at least benefit from the bugfixes.
So we don’t really have the luxury of simply “backing out” an upgrade like you do with your bundled copies. But switching to bundled copies for everything is not the solution (for the reasons Havoc and others have already given). The correct solution is working with sqlite upstream to get the performance back to acceptable levels as quickly as possible, not to expect distributions to ship the random old version you happened to test with. -
” Just screaming “you should always link against system libraries!!!!” isn’t going to work.”
Yes, the system library (like sqlite) might be buggy sometimes. But the correct fix is to talk with the upstream library people and get it fixed, not sulk and keep the non-buggy version bundled with the application. In the case of keeping a non-buggy version bundled (sqlite), you (mozilla team), are acting exactly like the distributions you are criticizing.
If distros had used mozilla with bundled sqllite, all *other* applications would still be using the the buggy sqllite.
Btw, if you were aware of this bug in this specific version of sqllite, why didn’t mozilla’s ./configure barf on detecting the buggy sqllite version?
-
Careful with that last suggestion: making this a hard error is wrong. It will prevent distributions from building against a patched version with this bug fixed. Distros don’t want to have to wait for the next release of the library if they can get a patch sooner, and they won’t necessarily want to upgrade rather than backport the fix either (due to things like feature freezes etc.). Fedora Core 4 got hit by such a hardcoded check against GCC 4.0.0 in KDE’s acinclude.m4, every single KDE app had to be patched with some variant of this patch:
http://cvs.fedoraproject.org/viewcvs/rpms/amarok/FC-4/amarok-1.2.4-gcc4bl.patch?hideattic=0&rev=1.1&view=markup
or this patch:
http://cvs.fedoraproject.org/viewcvs/rpms/koffice/FC-4/koffice-admin-gcc4isok.patch?hideattic=0&rev=1.1&view=markup
until GCC was updated to a later 4.0.x. But the bug KDE blacklisted GCC 4.0.0 for was fixed before the FC4 release! It just wasn’t 4.0.1 yet, but something inbetween. So the blacklisting was incorrect and annoying. -
> FF3 will require GTK 2.10
You call that “bleeding edge” and/or “far ahead of the distros”? Fedora has been shipping GTK+ 2.10 since Fedora Core 6 which isn’t even supported anymore, and 2.12 since Fedora 8 which isn’t even the latest version anymore (Fedora 9 is). Rawhide already has GTK+ 2.13 (the development branch heading towards 2.14), so presumably Fedora 10 will have 2.14.
> We’re very involved with them and know that that particular issue
> that Fedora ran across was fixed in a later version. We’re just too
> late in our cycle to take that new package since it would invalidate
> the beta cycle that over 1 million have taken part in.But Fedora Rawhide isn’t in this stage of the cycle, but in pre-alpha stage, so I suppose (again, it’s not me maintaining sqlite) that this will get fixed in Rawhide by upgrading sqlite, not downgrading it.
-
Pingback from shaver » fsyncers and curveballs on May 25, 2008 at 9:04 am
-
This is a great article on managing the risk of any software production that includes 3rd party components. It is sage advice for anybody building application out of 3rd party FOSS stacks.
However, I’m biased. SQLite is usually not the culprit from other experiences with it. Usually when something goes wrong, it is something else. My own observations tell me that when SQLite became the storage engine for Firefox and Chrome, browser stability improved. I claim that this is a correlation only, but my gut tells me that SQLite contributed to the robustness.


21 comments
Comments feed for this article
Trackback link: http://www.0xdeadbeef.com/weblog/2008/05/system-components-fsync-and-distribution-specific-changes-a-cautionary-tale/trackback/