bholley

Boiling the Ocean, Incrementally - How Stylo Brought Rust and Servo to Firefox

Tue, 28 Nov 2017 00:00:00 +0000

Two weeks ago, we released Firefox Quantum to the world. It’s been a big moment for Mozilla, shaping up to be a blockbuster release that’s changing how people think and talk about Firefox. It’s also a cathartic moment for me personally: I’ve spent the last two years pouring my heart and soul into Quantum’s headline act, known as Stylo, and it means a lot to see it so well-received.

But while all the positive buzz is gratifying, it’s easy to miss the deeper significance of what we just shipped. Stylo was the culmination of a near-decade of R&D, a multiple-moonshot effort to build a better browser by building a better language. This is the story of how it happened.

Safety at Scale

Systems programmers have been struggling with memory safety for a long time. It is virtually impossible to develop and maintain a large-scale C/C++ application without introducing bugs that, under the right conditions and input, cause control flow to go off the rails and compromise security. There are those who claim otherwise, but I’m quite skeptical.

Browsers are the canonical example here. They’re enormous - millions of lines of C++ code, thousands of contributors, decades of cruft - and there’s enough at stake to create large incentives to find and avoid security-sensitive bugs. Mozilla, Google, Apple, and Microsoft have been at this for decades with access to some of the best talent in the world, and vulnerabilities haven’t stopped. So it’s pretty clear by now that “don’t make mistakes” is not a viable strategy.

Adding concurrency into the mix makes things exponentially worse, which is a shame because concurrency is the only way a program can utilize more than a fraction of the resources in a modern CPU. But with engineers struggling to keep the core pipeline correct under single-threaded execution, multi-threaded algorithms haven’t been a luxury any browser vendor could afford. There are too many details to get right, and getting any of them even slightly wrong can be catastrophic.

Getting details right at scale generally requires the right tools. For example, register allocation is a tedious process that bedeviled assembly programmers, whereas higher-level languages like C++ handle it automatically and get it right every single time. But while C++ effortlessly handles many low-level details, it just wasn’t built to guarantee memory and thread safety.

Could the right tool be built? In the late 2000s, some people at Mozilla decided to try, and announced Rust and Servo. The plan was simple: build a replacement for C++, and use the result to build a replacement for Gecko. In other words, Boil the Ocean - twice.

Rust

I am a firm proponent of incrementalism. I think the desire to throw everything away and start from scratch tends to be an emotional one, and generally indicates a lack of focus and clear thinking about what will actually move the needle.

This may sound antithetical to big, bold changes, but it’s not. Almost everything successful is incremental in one way or another. The teams behind revolutionary products succeed because they make strategic bets about which things to reinvent, and don’t waste energy rehashing stuff that doesn’t matter.

The creators of Rust understood this, and the language owes its remarkable success to careful and pragmatic decisions about scope and focus:

They borrowed Apple’s C++ compiler backend, which lets Rust match C++ in speed without reimplementing decades of platform-specific code generation optimizations.
They leaned on the existing corpus of research languages, which contained droves of well-vetted ideas that nonetheless hadn’t been or couldn’t be integrated into C++.
They included the unsafe keyword - an escape hatch which, for an explicit section of code, allows programmers to override the safety checks and do anything they might do in C++. This allowed people to start building real things in Rust without waiting for the language to grow idiomatic support for each and every use case.
They built a convenient package ecosystem, allowing the out-of-the-box capabilities of Rust to grow while the core language and standard library remained small.

These tactics were by no means the only ingredients to Rust’s success. But they made success possible by neutralizing the structural advantages of C++ and allowing Rust’s good ideas - particularly its control over mutable aliasing - to reach production code.

Servo

Rust is a big leap forward for the industry, and should make its creators proud. But the grand plan for Firefox required a second moonshot, Servo, with an even steeper path to success.

At first glance, the two phases seem analogous: build Rust to replace C++, and then build Servo to replace Gecko. However, there’s a crucial difference - nobody expects the Rust compiler to handle C++ code, but browsers must maintain backwards-compatibility with every single webpage ever written. What’s more, the breadth of the web platform is staggering. It grew organically over almost three decades, has no clear limits in scope, and has lots of tricky observables that thwart attempts to simplify. Reimplementing every last feature and quirk from scratch would probably require thousands of engineer-years. And Mozilla, already heavily outgunned by its for-profit rivals, could only afford to commit a handful of heads to the Servo project.

That kind of headcount math led some people within Mozilla to dismiss Servo as a boondoggle, and the team needed to move fast to demonstrate that Rust could truly build the engine of the future. Rather than grinding through features indiscriminately, they stood up a skeleton, stubbed out the long tail, and focused on reimagining the core pipeline to eliminate performance bottlenecks. Meanwhile, they also invested heavily in community outreach and building a smooth workflow for volunteers. If they could build a compelling next-generation core, they wagered that a safe language and more-accurate specifications from WHATWG could allow an army of volunteers to fill in the rest.

By 2015, the Servo team had built some seriously impressive stuff. They had CSS and layout engines with full type-safe concurrency, which allowed them to run circles around production browsers on multi-core machines. They also had an early prototype of a full-GPU graphics layer called WebRender which dramatically lowered the cost of rendering. With Firefox falling behind in the market, Servo seemed like just the sort of secret sauce that could get us back in the game. But while Servo continued to build volunteer momentum, the completion curve still stretched too far into the future to make it an actionable replacement for Gecko.

Stylo

Whenever a problem seems impossibly hard, tackling it incrementally is a reliable way to gain traction. So near the end of 2015, some of us started brainstorming ways to use parts of Servo in Firefox. Several proposals floated around, but the two that seemed most workable were the CSS engine and WebRender. This post is about the former, but WebRender integration is also making exciting progress, and you can expect to hear more about it soon.

Servo’s CSS engine was an attractive integration target because it was extremely fast and relatively mature. It also serves as the bridge between the DOM and layout, providing a beachhead for further expansion of Rust into rendering code. Unfortunately, CSS engines are also tightly coupled with DOM and layout code, so there is no clean API surface at which to cut. Swapping it out is a daunting task, to say nothing of the complexities of mixing in a new programming language. So there was a lot of skepticism and some chuckling when we started telling people what we were up to.

But we dove in anyway. It was a small team - just me and Cameron McCormack for the first few months, after which point Emilio Cobos joined us as a volunteer. We picked our battles carefully, seeking to maintain momentum and prove viability without drowning in too many tricky details. In April 2016, we got our first pixels on the screen. In May, we rendered Wikipedia. In June, we rendered Wikipedia fast. The numbers were encouraging enough to convince management to launch it as part of Project Quantum, and scale up resourcing to get it done.

Over the next fifteen months, we transformed that prototype into the most advanced CSS engine ever built, one which harnesses the guarantees of Rust to achieve a degree of parallelism that would be intractable to replicate in C++. The technical details are too involved to get into here, but you can learn more about them in Lin Clark’s excellent writeup, Manish Goregaokar’s release-day post, or my high-level overview from last December.

The Team

Stylo shipped, first and foremost, thanks to the dedication and passion of the people who worked on it. They tackled challenge after challenge, pushing themselves to the limit and learning whatever new skills or roles were required to move things forward. The core team of staff and volunteers spanned more than ten countries, and worked (quite literally) around the clock for over a year to get it done on time.

But the real team was also much larger than the set of people working on it full-time. Stylo needed the expertise of a lot of different groups with different goals. We had to ask for a lot of help, and we rarely needed to ask twice. The entire Mozilla community (including the Rust community) deeply wanted us to succeed, so much so that almost everyone was willing to drop what they were doing to get us unblocked. I originally kept a list of people to thank, but I gave up when it got too big, and when I realized the countless ways in which so many Mozillians helped us in some way, big or small.

So thank you, Mozilla community. Stylo is a testament to your hard work, your ingenuity, and your good-natured, scrappy grit. Be proud of this release - it’s a game-changer for the open web, and you made it happen.

On the Right Fix, and Why the Bugzilla Breach Made Me Proud

Fri, 19 Feb 2016 00:00:00 +0000

At Mozilla, we keep security-sensitive bug reports confidential until the information in them is no longer dangerous. This week we’re opening to the public a group of security bugs that document a major engineering effort to remove the rocket science of writing secure browser code and make Firefox’s front-end, DOM APIs, and add-on ecosystem secure by default. It removed a whole class of security bugs in Firefox - and helped mitigate the impact of a bug-tracker breach last summer.

The Easy Fix and the Right Fix

Software bugs generally have an easy fix and a right fix, which are often not the same. The easy fix is faster to ship, but inspires less confidence that the problem is gone for good. Too many easy fixes make for brittle and unmaintainable code, but too many right fixes mean you never ship. This tradeoff is at the heart of software engineering.

For security-sensitive software bugs, the tradeoff looks different: there is a global community of adversaries trying very hard to trigger incorrect behavior. If it can break, it will break, and the cost to users can be enormous. Spot-fixing unsafe behavior at a single call-site can often do more harm than good, because it invites an army of watchful eyes to scour the codebase for similar instances of the same pattern. They don’t need to find all of them, just one.

Sometimes the right fix is out of reach. For example, completely eliminating memory hazards in a browser engine would require rewriting everything in a memory-safe language like Rust, which isn’t going to happen overnight. And so, to limp along, browser vendors have lots of engineers and lots of machines fuzzing day and night to try to find memory hazards before the black hats do. Discovering them this way is far preferable to discovering them as zero-days, but it’s still an enormously expensive way to tackle one particular class of bugs, and can only catch bugs after they’ve landed in the tree.

This is all to say that it is worth quite a lot of effort to find and implement the right fix to security bugs. And that’s exactly what we did with Slaughterhouse.

The Slaughterhouse Story

Late in 2013, I became concerned about a series of reports from a handful of external security researchers. These were all technically bugs in the parts of Firefox that are implemented in JavaScript - the front-end, add-ons, and various DOM APIs - and it was tempting to hand them to the appropriate developers to fix, along with a scolding for doing things wrong. But the worrisome thing was that the security issues at play - mostly variants of the Confused Deputy Problem - were extremely subtle. Avoiding them required a degree of expertise and paranoia that couldn’t realistically be assumed for every single person hacking on the JS inside Firefox, to say nothing of add-on authors whose code we don’t control.

The right fix was to make the platform safe by default. This was, unfortunately, easier said than done. There were gobs and gobs of pre-existing code running on the platform, and so we couldn’t simply disable the ever-widening set of interaction patterns we had determined to be questionable. Instead, we had to figure out a way to make it safe while preserving the semantics that people were depending on.

To make this work, we needed to make large, cross-cutting changes to a lot of core pieces of the platform - The JavaScript engine, the DOM, Xray Wrappers, Chrome Object Wrappers (COWs), WebIDL, Jetpack, Greasemonkey, and quite a bit more. The name “Slaughterhouse” was a tongue-in-cheek reference to “killing all the COWs,” along with a nod to the overall gravity of the situation.

We launched Slaughterhouse on October 22nd 2013, and declared victory exactly one year later. Over that year, we landed hundreds of patches across numerous components and products, coordinating a lot of careful and creative engineering work that I’m really proud of. The technical details are too much to delve into here, but they are quite cool: the final result is a very advanced Object Capability system that uses a membrane layer of ES6 Proxies to achieve strong security guarantees while preserving compatibility with existing code. For the curious, more details can be found in a talk I gave on the subject in Portland, the documentation of our script security architecture and Xray Wrappers, and finally, for the detail-oriented, the complete dependency tree in Bugzilla.

Avoiding the Worst of the Breach

This past August, Mozilla discovered that an attacker had compromised a privileged Bugzilla account and used it to obtain access to a number of security-sensitive issue reports. The attacker who breached Bugzilla was particularly interested in the kind of Confused Deputy bugs that Slaughterhouse was designed to eliminate, and searched Bugzilla for them specifically. When the attacker gained access to Bugzilla, we were just landing the final pieces of Slaughterhouse on Nightly, and had already shipped large parts of it to the Release audience of Firefox. Of the high-severity bugs that the attacker found, more than 80% had been fixed on Release by the time of access, and most of the others had fixes in the pipe on pre-Release channels. Moreover, any proof-of-concept exploits that the attacker found in Bugzilla would likely have been dead-ends, since we fixed the problems at the source in a way that would also fix any similar-but-different vulnerabilities lurking nearby.

When we learned of the breach, I waited with bated breath to hear how many of the high-severity bugs accessed by the attacker we had yet to fix in our Release channel. A few days later, the lead on the investigation pulled me aside with a grin:

“Two f***ing bugs!”

MozPromise: A Better Tool for Asynchronous C++

Tue, 18 Aug 2015 00:00:00 +0000

Last time, I argued that shared mutable state should be considered harmful. To avoid it, threads need to own their data and communicate asynchronously.

The traditional approach to asynchronous programming involves a lot of callbacks: every potentially-asynchronous operation needs to bundle continuation logic as a callback function which gets invoked when the operation completes. This is great in theory, but in practice has a lot of downsides:

The control flow is scattered across the callbacks, making the program logic harder to follow. Worse, this difficulty scales linearly with the number of asynchronous operations, creating a perpetual disincentive to make more things asynchronous.
The control flow can end mysteriously when an API forgets to invoke the callback. Debugging these cases is much more painful than debugging a synchronous hang, because there’s no backtrace.
There tends to be a lot of boilerplate to marshall control flow in and out of messages, and this boilerplate can be easy to get wrong.

These problems span languages and runtimes, but are particularly visible in the Web Platform, whose run-to-completion model forces most interesting APIs to be asynchronous. After a lot of experimentation with different ways to improve the situation, Web developers recently coalesced around Promises as the prevailing idiom for managing asynchronicity. And while Promises certainly have their detractors, most people in the JavaScript community seem to agree that they were a good idea. So why don’t we have something like that in C++?

The answer is complicated. Recent editions of the C++ standard library include a thing called a Promise, but it doesn’t look much like a JavaScript Promise, in large part because C++ itself has no concept of an Event Loop. Gecko has one though, so last November I started building some machinery to mimic the idioms of JavaScript promises in our C++ multimedia stack. This became MediaPromise, was later renamed to MozPromise, and finally moved from dom/media/ to xpcom/.

MozPromises work great, and I already forget how we lived without them. A number of other organizations and researchers have been barking up the same tree this year, which indicates that we’ll likely see a lot more of this kind of thing very soon. The MozPromise API was a quick-and-dirty job, and I expect the industry will iterate on these patterns and eventually produce something more elegant and general. That being said, the core ideas of MozPromise have proven themselves to be exceedingly useful in enabling asynchronicity and parallelism, and I think they’re worth sharing.

The Basics

A method whose result may be computed asynchronously returns a MozPromise. More specifically, it returns an nsRefPtr<MozPromise<ResolveType, RejectType>>, which is templated on the type of values that we want to propagate upon success or failure. Returning a smart pointer directly from a method is generally frowned upon, but we do it anyway. This enables us to compactly follow the MozPromise-returning method with a Then() call, similar to what we’d do in JavaScript:

mProducer->RequestFoo()
         ->Then(mThread, __func__, this,
                &ThisClass::OnFooResolved,
                &ThisClass::OnFooRejected);

Then() takes a strong reference to a callback object (this in the case above), and guarantees that either OnFooResolved(ResolveType) or OnFooRejected(RejectType) will eventually be invoked as an asynchronous event on mThread. We pass __func__ here and elsewhere so that the built-in logging can print out the entire history of control flow (this turns out to be quite useful).

The above already offers several useful advantages over traditional callback-based asynchronicity:

The exact callback method is selected by the caller at call time, and not exposed to the underlying API at all. This eliminates boilerplate interfaces, enums, and all the other junk that’s normally needed to hook up and invoke a callback across loosely-coupled modules.
The callback is guaranteed to be invoked asynchronously on the thread that the caller intended, without any additional work on the part of the callee. This eliminates common re-entrancy and thread-safety pitfalls.
Hangs are easy to diagnose, because we have an object (the MozPromise) which tracks the request we made. The logging facilities make it simple to locate the caller that allocated the MozPromise, even when it’s buried deeply in unfamiliar code.
Mandatory first-class error handling.

Even with those advantages, the above code still requires the reader to jump to a different line to follow the control flow. This is often fine, but gets cumbersome when OnFooResolved is one or two lines of code. Fortunately, C++ lambdas allow us to support an alternate overload of Then() with inline callbacks:

mProducer->RequestFoo()
         ->Then(mThread, __func__,
                [...] (ResolveType aVal) { ... },
                [...] (RejectType aVal) { ... });

Note that the resolve/reject values are always optional, so they may be omitted from the callback signature when unnecessary.

MozPromises are also chainable, provided that the types match up:

mProducer->RequestFoo()
         ->Then(mThread, __func__,
                [...] (ResolveType aVal) { ... },
                [...] (RejectType aVal) { ... })
         ->CompletionPromise()
         ->Then(mOtherThread, __func__,
                [...] (ResolveType aVal) { ... },
                [...] (RejectType aVal) { ... });

This works much like you’d expect from JavaScript: the first callback can itself return a MozPromise, to which the second Then() is indirectly applied. Unlike JavaScript though, callers must explicitly invoke CompletionPromise() to access the thenable. This allows us to optimize some things in the common case where it isn’t needed. More importantly, it permits us to do something more interesting with the direct return value of Then() - more on that in the section on disconnection below.

InvokeAsync

The core difficulty of owned-data multi-threading comes when logic on thread A needs to interact with data owned by thread B. This is where MozPromise really shines, with the aid of a small helper called InvokeAsync:

InvokeAsync(mOtherThread, [MozPromise-Returning Method])->Then(...)

InvokeAsync dispatches a runnable to an arbitrary thread B to invoke a method that returns a MozPromise. Next, it returns a separate MozPromise of the same type to its caller on thread A. When the target method executes, the returned MozPromise is chained to the one that was returned to the caller on thread A, such that resolution and rejection are propagated through.

This allows for safe and transparent cross-thread procedure calls: thread A can invoke a method on thread B and use the result directly, whether or not A == B. In other words, MozPromises hide the details of message-passing and offer uniform, programmer-friendly ergonomics for same-thread and cross-thread procedure calls. This dramatically reduces the cost of dividing program logic across multiple threads, and puts the fruits of parallelism much more within reach.

Disconnection

MozPromise callbacks are cancellable up until the moment they are invoked, which is very powerful when you want to interrupt the operation of a long, asynchronous pipeline.

Consider the case of seeking a media element in Gecko. The request originates from a user action (manipulating the video controls), which invokes a setter on HTMLMediaElement on the Main Thread. This forwards request to the Decoder State Machine Thread, which queues up work on the Media Decode Thread, which is usually delegated to an OS-specific Platform Decoder Thread.

This presents a serious problem when the user interrupts the action and tries to seek somewhere else before the original seek has completed. There’s a lot of inertia in the pipeline, and we have no way to stop it instantaneously. The first operation may have already completed, and the result might already be on its way back, so we risk getting confused if we initiate a second operation.

So we ended up writing code like this:

mQueuedSeekTarget = newSeekTarget;
mReader->DispatchCancelSeekTask();
// Wait for the previous seek to succeed or be canceled.
// When everything is finished, we'll check mQueuedSeekTarget
// and start seeking again.
return;

This is clearly suboptimal, and we can do better with MozPromises.

I mentioned earlier that Then() does not return another MozPromise. This is because it returns a MozPromise::Request, which is a handle to the callback invocation scheduled by Then(). Callers that care can store this value, and invoke Disconnect() if they no longer wish to receive the callback. This allows us to handle interrupt seeks much more elegantly:

mSeekRequest.DisconnectIfExists();
mReader->DispatchCancelSeekTask();
// Move on with life \o/
mSeekRequest.Begin(
  InvokeAsync(mReader, &Reader::Seek)
  ->Then(...)
);

We use disconnection heavily in Gecko’s media stack, and it is exceedingly useful in handling interruptions, error conditions, and shutdown.

Conclusion

MozPromises solved a lot of tricky problems for the Media Playback Team, and I’d encourage other Gecko hackers to give them a spin. I also welcome thoughts and feedback from the wider community of developers and theorists - I’m sure there are heaps of improvements to be made, and I would be especially interested in concrete suggestions that are practical to implement.

The latest version of the code can be found here, and a snapshot of the code at the time this piece was written can be found here. Patches welcome!

Must be This Tall to Write Multi-Threaded Code

Fri, 17 Jul 2015 00:00:00 +0000

David Baron put this up in Mozilla’s San Francisco office a while back:

This is cute way of saying that writing safe concurrent code is, at present, rocket science. This is unfortunate, because the future of computing is shaping up to be all about concurrency. Fundamental engineering constraints like power usage are steering microprocessor manufacturers away from single-core architectures. If the fastest chips have N cores, a mostly-single-threaded program can only harness (1/N)th of the available resources. As N grows (and it is growing), parallel programs will win.

We want to win, but we’ll never get there if it’s rocket science (despite the industry-leading density of rocket scientists hacking on Gecko). Buggy multi-threaded code creates race conditions, which are the most dangerous and time-consuming class of bugs in software. And while we have some new superpowers to help us react when they inevitably occur, debugging racey code is still incredibly costly. To succeed, we need to prevent races from happening in the first place.

Why do races occur? Opinions differ, but I argue the following:

Races are endemic to most large software projects because the traditional synchronization primitives are inadequate for large-scale systems.

Let me explain.

Small-scale systems are easy to build and maintain. So long as the details can all fit in the heads of a small number of programmers, it’s relatively easy to shuffle things around to meet requirements and verify that all the pieces interact properly.

Large-scale systems are a different story. Many cooks in many interdependent kitchens necessitate strong, assertable rules that allow programmers to reason about the unknowable. These rules provide a baseline level of order, but to be truly useful, they need to be predictable: different programmers should be able to invent similar or identical rules by deriving them from a small set of core principles, such that everyone can make reasonable predictions about the high-level behavior of code they haven’t read.

Software engineering at this level is an art, whose core mission is to find the right abstraction - one that naturally offers guidance and solutions for the problems that need to be solved (especially the ones that don’t exist yet). The wrong abstraction is painful and error-prone. The right one is a never ending stream of goodness from which all answers flow.

Locks don’t lend themselves to these sorts of elegant principles. The programmer needs to scope the lock just right so as to protect the data from races, while simultaneously avoiding (a) the deadlocks that arise from overlapping locks and (b) the erasure of parallelism that arise from megalocks. The resulting invariants end up being documented in comments:

// These variables are protected by monitor X:
...

// These variables are only accessed on thread Y:
...

And so on. When that code is undergoing frequent changes by multiple people, the chances of it being correct and the comments being up to date are slim.

There’s a familiar story that has repeated itself many times throughout Gecko’s history:

Engineering leadership sees benefits to accessing some component on multiple threads, and kicks off an effort to make it thread-safe.
The component becomes incredibly complex and difficult to maintain. Quality suffers, engineers avoid touching it, and improvements slow to a trickle.
The component is now plagued with problems. The owner comes up with an elegant new design that solves most of the problems, but needs to forbid multi-threaded access to make it work. This is deemed a good trade-off by everyone, and the component becomes non-thread-safe once again.

At first glance, this constant retreat from thread-safety by seasoned programmers looks pretty grim for a multi-core future. However, these programmers aren’t fleeing concurrency itself - they’re fleeing concurrent access to the same data. That is to say, safe and scalable parallelism is achievable by minimizing or eliminating concurrent access to shared mutable state.

In this approach, threads own their data, and communicate with message-passing. This is easier said than done, because the language constructs, primitives, and design patterns for building system software this way are still in their infancy. Rust is designed from the ground up to facilitate this, and uses its type system to enforce single ownership of data. We’re already using some Rust in Gecko, but we’re not going to be rid of C++ anytime soon. So it’s critical to explore ways to incrementally add safe concurrency in C++.

During the first half of this year, I did a tour of duty with the Multimedia Playback Team to help rebuild the heavily-threaded decoding and playback pipeline to be less racey. To solve the problems I encountered there, I built some new tools and primitives that ended up being game-changers in our ability to easy write easy and correct concurrent code.

Back on the Web

Sun, 12 Jul 2015 00:00:00 +0000

I’ve had a lot of websites over the years. The first one went live when I was eleven years old, and was admittedly much more about style than substance. That one was on Angelfire, the next on Tripod, then a Xanga, then self-hosted MovableType and WordPress. You can probably find them all of them archived somewhere if you try, but I won’t disturb their slumber by linking to them here.

I didn’t have much to say back then, but I sure went to great lengths to say it anyway. I have more to say now, but much more trouble finding the time to write it down. I racked up a grand total of nine posts over seven years on my most recent blog with decreasing frequency. The more things I’m working on, the less often blogging bubbles up to the top of my list.

I’d like to change that. There’s a backlog of stuff I’d like to write about, and I’m hoping that I can overcome the inertia that has foiled me in the past by keeping posts short and specific.

Here goes.