As too few software engineers know, about 65% of known vulnerabilities in C/C++ codebases are due to memory unsafety. The “65% result”, as I’ll call it, is consistent across vendors and across decades. Obviously, that conclusion comes from data biased in a certain way: these are the vulnerabilities that attackers and defenders have been consistently able to find; these are the vulnerabilities that vendors are willing to disclose at all; these are the vulnerabilities that we choose to talk about.
We know there’s more going on out there. Even so, the result is useful and actionable.
There are also several efforts to track known vulnerabilities that we know have actually been exploited in the wild (ITW):
(See also David Cook’s ICS-CERT Advisories Scraper, which is specific to ICS and covers more than just ITW bugs.)
These datasets are also necessarily biased: these are (some of) the vulnerabilities attackers can find, and (some of) what they can actually field. But we also know that phishing and other social-engineering techniques account for a huge portion of real-world attacks.
These different biases are useful: the more information we have about what’s possible and what attackers are really doing, the better we can respond — as long as we seek out a variety of datasets and remain aware of their biases. I’d love to see additional datasets with entirely different foci, e.g. credential and permission phishing, fake invoice fraud, USB drives with malware left sitting around... We might not be able to get data about side-channel attacks fielded ITW, but I can dream.
I was curious to see if the 65% result aligned with what we see in CISA’s data. I imported their CSV into a Google sheet, and started categorizing the CVEs according to a sketch of a taxonomy I came up with for this purpose, and by the implementation language of the target. (See the type and language columns.) To substantiate the classifications where the description was not obvious, I also added a column with details on the bug and/or its exploitation. This column usually contains a link to a proof of concept (PoC), an analysis, or other details.
Additionally, Jack Cable mapped the CVEs to their CWEs (see new columns cwe and cwe2), and mapped those to the taxonomy described here. That analysis shows substantially different results, which is interesting. This suggests that CWEs (or any data in the CVE entry) don’t contain enough information to tell whether a vulnerability is memory related. For instance CWE-20, improper input validation, may or may not result in a memory-unsafety vulnerability.
It’s not complete: we haven’t finished categorizing all the bugs for type (currently 60% done) nor for language (currently 58% done).
The way in which it’s incomplete is not random: I did a bunch of easy ones first, I did some searches for particular keywords and categorized those first, and so on. Therefore the percentages calculated at the top are not necessarily what we’ll see once the categorization is complete.
My classifications might be wrong! It’d be nice to have more people go through and see if they agree with how I’ve categorized things — some bugs might have the wrong type or language. If you can correct an error, fill in an unknown, or add more detail, please add a comment on a cell. Thanks! (See also the “Unknown” filter view.)
There’s more to be done. For example, with some somewhat hairy spreadsheet code, you can find out some fun facts about the distribution of the bugs. For example, in cell F2, I’ve calculated the percentage of C/C++ bugs that are memory unsafety (currently 55.18%):
=round( multiply( 100, divide( countifs( B10:B1000, "=memory", A10:A1000, "=C/C++"), countif(A10:A1000, "=C/C++") ) ), 2)
Since the time I started this lil project, CISA has added many more rows to their spreadsheet. And I have not done the same analysis with other datasets like P0’s and Ritter’s.
Here is how I categorize the vulnerabilities in CISA’s dataset:
|Type||Sarcastic Name||Description||Examples (non-exhaustive)|
|memory||“C problems”||Spatial or temporal memory unsafety||Buffer overflow, use-after-free, write-what-where, double-free, leak or use of uninitialized memory|
|eval||“Lisp problems”||Treating attacker data as interpreted code||SQL injection, XSS, shell injection, deserializing evil objects and loading their evil classes|
|logic||“Brain problems”||Errors in application-layer logic||Incorrect branch condition, incomplete information in branch condition, type confusion, integer semantic sadness that does not result in memory unsafety|
|configuration||“Face-palm problems”||Errors in default or likely deployment configuration, misfeatures||Leaving the debug interface on in production, web shell as a ‘feature’, default passwords|
|cryptography||“Math problems”||Errors in the use of cryptography, including not using it||N-once reuse, low-entropy keys, confidentiality where integrity is needed (or vice-versa, or both), plaintext|
|ux||“Human problems”||Problems that arise when the UI, UX, or social context does not match human needs, limitations, or expectations||Phishable credentials, affordances favoring errors, confusing UI or documentation, high effort/concentration required, UI redressing|
It’s difficult to create a universally-applicable taxonomy. (Ask any biologist.) You can see everything as a logic bug, or you can see C’s problems as being user experience bugs for developers (DX): affordances that favor errors, too hard to use consistently safely, and counter-intuitive semantics1.
My categories are intentionally broad, for 2 reasons.
As a defender, I typically classify bugs by asking what went wrong in the
design or implementation, and how are we going to fix it. What would the fix
look like, and can it be systematic or (semi-)automated? Checking bounds,
correcting object lifetimes, fixing an
else condition, fixing the deployment configuration?
Un-shipping a misfeature? UX research?
Now, sometimes we might exploit e.g. memory unsafety to achieve type confusion, or vice versa, or use e.g. buffer overflow to achieve command injection. I categorized these bugs by what I see as the first error in a possible chain of errors. (Although I won’t say “root cause”, of course.)
In some cases, memory safety would have stopped exploitation, even if memory unsafety is not the first error in the chain. I typically classified those as logic bugs.
Notably, I am not classifying bugs by their outcomes during exploitation, e.g. information disclosure, remote code execution (RCE), local privilege escalation (LPE), denial of service (DoS), et c.: the same bug may have many possible outcomes. Nor do I classify by severity: everyone has a different threat model, so a standardized severity system is typically hard to apply meaningfully.
To see these patterns for yourself, it helps to make heavy use of the Filter
View feature of sheets. You can also make a copy and add in your own
=COUNTIFs and so on. I bet there are patterns I missed! Please add
comments to the sheet or email me if you see something interesting.
Path traversal accounts for a large chunk of vulnerabilities (which I categorize as logic). As with URLs, text strings are an alluring but ultimately not consistently workable interface for describing paths from root to branch in a tree. People just can’t decode, resolve, or compare them consistently, and those are security-critical operations.
There’s lots of ‘remote shell as a feature’ going on (which I classify as configuration). Debug interface? Quick-and-easy way to implement some functionality? Lack of proper library APIs for some functionality? All of the above, I’d imagine.
CISA’s data does not count UX bugs that make phishing (of various types) and misuse/misconfiguration more likely or easier to attack — but we know they are a big part of exploitation in real life. I suspect the ux category is vastly underrepresented. If we counted them, ux might be greater than all the other categories combined.
The goat in the room is credential phishing. This fatal problem will remain rampant until we build support for WebAuthn into all important services.
Unsurprisingly, the memory category is the biggest single category (so far), although it’s not fully 65% of the bugs used against C/C++ targets.
Keep in mind that the 65% finding is for codebases that are in C/C++, but this dataset describes systems implemented in a variety of languages — and most languages are memory-safe. Memory unsafety exploitation may be over-represented as an attack type in the dataset; i.e. perhaps attackers in the wild are favoring it because of the control such bugs provide, their skill sets, stealth, or similar kinds of reasons.
I’d point to eval as the true second most immediately actionable category for fixing/exploiting (depending on your proclivities). There’s so much easy-to-find stuff in that category, with a variety of techniques for discovery.
The logic category is hugely broad — almost a default — so its prominence as second-biggest category might not be as meaningful or actionable. (Although you will see patterns in that category.) It represents a long tail of scattered bug classes and (hopefully) one-offs.
We need to have a blameless postmortem for a way of documenting vulnerabilities that is already dead.
These are vulnerabilities that affect people’s lives, government policy, the economy, civil society — all the bugs in question have been exploited ITW — yet there’s so much noise, obscurantism, and bravado that it’s often more difficult, not less, for people to decide what to do.
We need to stop writing, and accepting, vague write-ups like “execute code via unspecified vectors”, “allows remote attackers to cause a denial of service or possibly execute arbitrary code”, “the vulnerability is due to insufficient handling of [data]”, and so on. (These are real examples!)
A big part of the purpose — or, potential — for public vulnerability announcements and reports is to teach and learn, mature the engineering culture, and above all to avoid repeating these problems. And for that, we need specifics and we need sufficient certainty. Being vague is not the most effective way to compensate for risk.
The people building the infrastructure of our world, and bodies like the Cyber Safety Review Board, are most effective when they have all the facts at hand. Aviation safety has made huge game-changing improvements over the decades, but not without full access to the (sometimes embarrassing) details. We need Feynman explaining the Challenger explosion, not handwaving. The links I’ve added in the additionalDetail column are, overall, much more like the Feynman-grade stuff we need to get a real grip on what’s going on.
Working on this classification project required us to hunt for additional details when the official descriptions were lacking. In about 9 hours of work, I was able to get through about 45% of the 616 bugs. If the official descriptions had had enough content, I could likely have finished 100% in much less time.
Hunting for bug detail led me to the unfortunate conclusion that a CVE number is little more than a search keyword. You always have to go to hacker blogs, bug trackers, and find and read PoCs. Very occasionally, the vendor’s announcement would have more detail than the CVE entry.
Pro Tip: Don’t start by just searching for the CVE number. The top 10 hits are going to be just sites that copy the CVE entry. (I will file this as a Search quality bug when I get to work on Monday.) Instead, you have to be more precise (the quotes help):
"cve-abcd-efgh" "project zero"
Sometimes you can get some detail by searching Twitter, too.
Ultimately, all software bugs are logic errors — software is logic. But what
I’m looking for are systematic ways to correct errors, and the bug
classifications reflect that. As defenders, we shouldn’t fix individual buffer
overflows; we must stop using C. We shouldn’t fix SQL injections; we must use
parameterized queries. We shouldn’t fix shell injections; we must stop using
popen, and instead build and use real APIs.
We shouldn’t fix instances of XSS; we must use a structured templating system.
And so on.
Almost all of the exploited vulnerabilities are quite mundane, and solvable by mundane means. They’re not sexy or weird or surprising — and that’s good news. So much pain and trouble can be solved with simple tools that we already have.
We need to get increasingly clear about implementation quality requirements. This includes stopping new uses from getting into our codebases (with presubmit scripts or Git hooks) and systematically auditing for them and treating them as bugs and technical debt to prioritize paying off. Often, you can simply grep or weggli for these, and get a list. There’s also CodeQL.
Our goal as software engineers should be to eventually get down to only bugs that are one-offs, specific to the application.
Thanks to Dev Akhawe for helping me categorize the bugs, and thanks to Jonathan Rudenberg and Eric Rescorla for reading early drafts and proposing improvements. Jack Cable mapped the CWEs to the taxonomy I use here. Any errors, and there are surely many, obviously remain my own.
Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp. This includes
After all, what is
%n but a hard-to-use form of
eval? 🤔 🤨