Prioritizing Memory Safety Migrations

Update 11 April: Please also see Long Live Sandboxing!. Sandboxing is not dead, despite what you might have heard.

With all the talk of using Rust to reduce memory unsafety bugs, such as Android using Rust in the Android Open Source Project, there’s a lot of extremely reasonable concern about the high cost of “rewriting it all in Rust” (or any other safer language), as it’s often phrased. Operating systems, web browsers, complex online services, and so on can be implemented with tens of millions of lines of C/C++ code. (Sometimes more.) Rewriting all that seems prohibitively expensive, and exacerbates what Alex Gaynor aptly calls grief — people stay in the denial stage longer when struck by the enormity of the memory unsafety problem.

Thankfully, replacing C/C++ with code in a safer language is not an all-or-nothing task. We can do it gradually; some parts we might never need to replace. Most safer languages can link in the same address space as C and/or C++, and call into and be called by C/C++. You can also normalize data structures such that the safe code handles arbitrary inputs, and the C/C++ code can focus on a single, simpler grammar. For example:

internet → { PNG,
JPEG, GIF, TIFF, ... } → SkBitmap
You can accept arbitrary image (e.g.) formats from the internet, use a safer language to normalize them into Skia’s simple SkBitmap format, and then handle the bitmaps in Skia in C++. This simplifies the C++ code (reducing its attack surface), and provides a simple cross-language interface.

But how do you tell where to start replacing C/C++ with safer code, and where to stop?

Although security is certainly not the only benefit of a safe language — the Android team’s post starts out stressing correctness — my perspective is security. And from that starting point, we can use what I think is a pretty clear method to prioritize our efforts.

Even if you have, say, 20 million lines of C++ code, not all of it is directly or indirectly exposed to attackers. You can start hardening the most exposed code first, and you can rank exposure by how long the path to the code is. Consider Ian Beer’s epic radio pyrotechnics, in which he compromised iPhones by sending them mean-spirited packets by radio. We can model the attack surface exposure something like this:

attacker → radio
chip → kernel
An attack pathway from the internet to the kernel.

That’s a bit of an oversimplification, but it lets us see that the attacker’s call graph is not very deep — that is, that driver is pretty exposed.

Additionally, as the title of Ian’s post points out, the attacker’s cost to traverse the first few edges is 0. We can model that by assigning ‘weight’ or ‘cost’ to the edges in the graph. The higher the cost, the less likely it is that the attacker will succeed. Assuming the radio is fairly simple and does little normalization or filtering before passing what it got to the kernel, we might draw something like this:

attacker → (0)
radio chip → (low or medium) kernel
An attack pathway from the internet to the kernel, with estimated costs for each edge traversal.

On a scale from 0 → low → medium → high, we might generously estimate the cost to exploit the vulnerable driver to be maybe as high as medium. If the defender is lucky, maybe ASLR is working, or something.

Ian explains everything in full detail in his post, but in general we should not think of C/C++ code as defensible. If an attacker is able to get at C/C++ attack surface, we must assume they can win.

As an additional example, consider your web server’s or browser’s TLS implementation. Should we consider it exposed? We can model it something like this:

attacker → (0,
low, or medium) net interface → (passthru) kernel → (passthru) TCP → (low or
medium) TLS
An attack pathway from the internet to the application’s TLS implementation.

In this case, the attacker is interested only in the application’s TLS implementation, and is just using the kernel as a way to get there — they are not attacking the device driver or the TCP implementation. (Although those are also exposed attack surfaces, of course.) The kernel typically does not do anything with the application layer traffic, passing it verbatim to the userland application. So the kernel is not creating a security boundary in this case.

The attacker has a pretty straight shot to your application’s TLS implementation; the only attack precondition is whether the attacker can send malicious TLS traffic to the application. Obviously, servers listen to the internet and process whatever they get; that’s 0 cost. If attacking a client, an attacker may have to get the target to contact their server or may have to be on the same network as the target. We might say that is up to medium cost.

So, things like device drivers and HTTP, TCP, and TLS implementations are all fine candidates for (re)implementing in a safer language. They’re just unavoidably exposed.

Consider an example where the C/C++ attack surface is not as directly exposed.

attacker → ... →
HTTP → parse CSP → evaluate CSP
An attack pathway from the internet to a client’s CSP evaluator.

In this example, we have an HTTP client that is going to parse and evaluate a Content Security Policy (CSP) header. Each of the network interface, device driver, TCP implementation, TLS client implementation, HTTP client implementation, and CSP parser are fairly exposed attack surface. For example, if the attacker wants to exploit some bug in the CSP parser, they can likely rely on all of the previous components to pass the header value through verbatim to the CSP parser. Thus, they probably do not create a security boundary.

But if the attacker wants to exploit a likely bug class, mis-evaluation of CSP policy, they must first get past the CSP parser. Although it is vulnerable attack surface, it does also provide some security boundary: the policy must be well-formed according to the grammar the parser accepts. Another bug class is that the parser’s grammar is not necessarily the same as the grammar in the spec.

Thus, we’d be speaking of logic bugs in the CSP parser and/or evaluator. This is the kind of code that can be buggy in any language; this is not memory unsafety that can be resolved at scale with a safer language.

These examples suggest that you have to get fairly deep into the call graph before memory unsafety becomes less of a concern. That’s consistent with the findings that memory unsafety accounts for anywhere from ⅔ to ¾ of vulnerabilities. The problem is that bad.

Models like those above can be step 1 in a process of repair triage. You might order a set of constraints when filtering through what code to rewrite, apply mitigations or testing to, or even get rid of first:

  1. Select the most exposed code
  2. ...of that code, start with the highest-privilege code
  3. ...of that code, start with the code that has the highest observed bug count

Or you might triage differently, depending on your situation:

  1. Select the most exposed code
  2. ...of that code, start with the code that has the highest observed bug count
  3. ...of that code, start with the highest-privilege code

Or even:

  1. Select the code that has the highest observed bug count
  2. ...of that code, start with the most exposed code
  3. ...of that code, start with the highest-privilege code

Which approach is appropriate depends on your system. For example, if you are working entirely in the kernel, all your code runs at the same level of privilege so you can’t use that as a filter. Or if you are in userland, but your application is not making use of process sandboxing, consider exploring that first before starting a rewrite effort.

In any case, we don’t have to “rewrite everything in Rust” to improve memory unsafety, and we are not lost in a sea of undifferentiated attack surface. There are ways we can prioritize in a somewhat systematic way — we don’t have to fix random things ad hoc.

Thanks to Jacob Hoffman-Andrews, Andrew Dunham, and Dev Akhawe for reading drafts of this post and suggesting helpful improvements!