On Validating Inputs

Introduction

A principle of secure software development is that only the callee (sometimes called the relying party, RP) should trust itself to validate an input. Typically, the callee has a higher degree of privilege than the caller, and will use its privilege to perform some task for the caller if and only if the input is valid.

Allowing the caller to assert that an input is valid is unsafe: of course a malicious caller will claim the input is valid. A classic example of this mistake is when a web server trusts client-side validation in JavaScript.

The Problem

However, input validation functions are attack surface. Validating inputs is dangerous, and difficult to do correctly. This is especially true when the input language is complex or when the validator is implemented in an unsafe language. Validators can suffer from a variety of bug classes:

Memory unsafety.

In a memory-unsafe language, it’s too easy to write a validation function that suffers from memory corruption bugs, such as out-of-bounds reads and writes, and use-after-free. This is especially likely if the input language is complex, but even simple input languages can be hard to process correctly in primitive languages.

Type unsafety.

In a type-unsafe language, it can be easy to write a validation function that confuses either the types of input objects or of intermediate types particular to the function’s implementation. Attackers can often exploit such type confusion (often, but not only, by turning it into memory corruption).

Semantic unsafety and application logic bugs.

Some input languages are not necessarily a good match for their problem domain. They may be overpowered, underpowered, or have a poor mapping to concepts in the application domain.

Other input languages have semantics that don’t quite match those of the validating function’s environment, or are underspecified. For example, the JSON ‘grammar’ allows for arbitrarily-long sequences of digits in numeric value literals, but the JavaScript language uses fixed-size, 64-bit IEEE floating point values to represent numbers. JavaScript can correctly represent only 53-bit integers.

The Paradox Of Validation

If the input language is sufficiently complex — which is all too likely in the real world — it might be impossible to write a safe validator. In this case, we have a paradox: the RP can only trust itself to validate its inputs, but cannot safely do so. For example, consider a web server that needs to validate XML inputs, or applications that handle ASN.1.

What Could Go Wrong?

There are (at least) 2 different kinds of complexity that create (at least) 2 different kinds of potential problems during input validation.

Grammatical complexity, and subsequent semantic complexity.

Part of the language-theoretic security goal is to have inputs come from languages low on the Chomsky hierarchy. This reduces the complexity of validators for such inputs, which increases the likelihood that they will be correct.

Even if the validator is not compromised or exploited during validation, there is still the problem of correct semantic interpretation. Complex languages are harder to interpret correctly, giving rise to the problem that the RP might misinterpret a correct and correctly-validated input.

Side-effect complexity.

There is also a question as to the complexity of side-effects that an input language is designed to cause on the operating environment when interpreted. Side-effect complexity is orthogonal to grammatical complexity; even a trivial input language can be used to cause unsafe or undesired side-effects. For example, consider a shell command validator (perhaps as part of a system status monitoring web application):

def is_valid_command(command):
  return re.match(r"^\w+$", command)

def run_command(command):
  if not is_valid_command(command):
    raise ValueError("Nice try, goofball")
  os.system(command)

Although the validator is trivially correct (assuming a design calling for single-word commands with no parameters or shell syntactic metacharacters), and is invoked correctly, the above code is unsafe. The triviality of the grammar does lend itself to further validation though, which is another reason to prefer simpler grammars.

A tighter validator can help us sleep a little easier:

AllowedCommands = set(("uptime", "dmesg", "netstat"))

def is_valid_command(command):
  return command in AllowedCommands

(Presumably, the designers of the application have accepted the risks of exposing the outputs of those commands to callers, presumably authenticated and authorized somehow.)

What About Sandboxing?

What do we gain if the RP delegates validation to a privilege-reduced validator (PRV; e.g. a sandboxed parser), and then trusts the PRV’s result? If the PRV was compromised during validation, its result is no more trustworthy than without sandboxing. But at least the RP would not itself have been compromised.

Sandboxing can reduce the range of potential side-effects that the validator can perform, ensuring that only intended side-effects take place (if any), even if the validator is compromised.

Sandboxing does nothing for grammatical or semantic complexity, however, and does not reduce the risks of later misinterpretation.

What About Pre-processing?

You might also imagine that we can transform an input into a less complex or dangerous form: either a normalized or minimized form of the original input language, or a different, simpler language. I’ll call this downward transformation, but it might have a more official name (please email me if you know it!).

Downward transformation can be lossless or lossy. (It might seem that downward transformation would be inherently lossy, but recall that input languages are sometimes overly powerful. It may be possible to downwardly transform such a language without semantic loss.)

Lossy transformation can carry its own application-semantic risks. For example, consider transforming XML into JSON: XML has the concept of a document type definition (DTD), enabling XML parsers to enforce structural invariants of an XML input. After a pre-processor has transformed the document to JSON, the recipient of the JSON input would have to re-encode the assertions encoded in the DTD, with attendant risks of semantic skew and divergence over time.

Although potentially useful, downward transformation is not a reduction in overall system complexity. In fact, it’s an increase: the system must now be able to handle 2 languages, not just 1. But it can be possible to separate the processing of the 2 languages, including sandboxing 1 or both.

Approaching Solutions

Implement validators in memory-safe languages.

At a minimum, we can and must eliminate the most rinky-dink vulnerability class: memory corruption.

Implement validators in type-safe languages.

When it’s trustworthy, we can use the language’s type system to enforce grammatical and semantic correctness, for example by encoding each syntactic structure as a distinct type, with assertions about child structures and their types. Proofs for free!

Accept only well-specified languages.

Unfortunately, this is easier said than done, both because it’s hard to specify a language, and because a few people are still using JSON. Still, we can sometimes tighten the effective definitions of poor languages that we are forced to accept, by defining subsets and dialects and by writing them into the API definition. (E.g., for a JSON validator: “This function will throw RangeException for integers not in the range – (253 – 1) .. 253 – 1, inclusive.”)

Sandbox to constrain side-effects.

This is ‘obvious’, but it’s not done nearly as often as it should be. Minijail is a good sandboxing option for Linux. There is also NsJail.

Cross-implementation unit and integration tests.

My favorite example is Nicolas Seriot’s JSON torture test/freak show.

Benefits Of Safety And Simplicity

A memory-safe implementation of a validator for a language with no side-effects on the environment (e.g. no shell command intepreter) would allow us to, potentially, skip sandboxing. This is valuable because, outside of a type-safe virtual machine, the fundamental sandboxing primitive is the OS’ process boundary. It can be surprisingly expensive, in both time and space, to create new processes. The problem gets worse at a high degree of granularity (e.g. a fresh sandbox per input, or even potentially 2 for downward transformation) and especially with requirements for high throughput (many inputs) and/or low latency.

From a systems perspective, it’s reasonable to question if unsafe-but-efficient languages are in fact efficient.