Death Before Confusion (was: [whatwg] Handling out of memory issues with getImageData/createImageData)

# Mark S. Miller (9 years ago)

[-whatwg, +es-discuss] Reposting to es-discuss, as Anne's more general question is best seen as a JS issue rather than a browser specific one

On Sun, Sep 27, 2015 at 8:30 AM, Mark S. Miller <erights at google.com> wrote:

On Sat, Sep 26, 2015 at 7:34 AM, Anne van Kesteren <annevk at annevk.nl> wrote:

On Fri, Sep 25, 2015 at 4:48 PM, Justin Novosad <junov at google.com> wrote:

Currently there is no spec'ed behavior for handling out-of memory issues for the specific case of attempting to allocate a large buffer through image data APIs.

Actually, there is no specified behavior for out-of-memory behavior, period. This is a problem that starts with the ECMAScript standard and everything that builds upon it.

I have seen Mark Miller discuss some of the issues surrounding this and perhaps even the necessity to eventually define it, but so far this has not happened. Not sure if the full story is documented somewhere. Mark?

esdiscuss.org/topic/using-max-stack-limit-to-determine-current-js-engine-and-revision#content-7 indicates there may be security issues with throwing out-of-memory exceptions.

Well, the full story is never documented ;). However, that post and the links from there:

www.eros-os.org/pipermail/e-lang/2007-January/011817.html, google/caja#460

are a good start. The security issue is serious and needs to be fixed. It cannot practically be fixed by libraries without additional help by the platform. The problem is that

  • In a language that implicitly allocates everywhere, like JavaScript, Java, and many other oo languages, it is impossible to prevent a code from causing OOM
  • If OOM is thrown (see the first link for Java/Joe-E issues), and the language has try/finally, it is impossible to prevent the OOM being masked.
  • In such languages, it is impossible to program defensively against the pervasive possibility of OOM -- if execution simply resumes in that context as if nothing bad happened.

In Joe-E we took the painful step of outlawing the Java try/finally from the Joe-E subset of Java for this reason. There was no other reason to outlaw try/finally as there's nothing else inherently unsafe about it. We really tried to find another solution but under our constraints -- no rewriting of the Java nor change to the JVM -- we could not.

By preventing Joe-E code from catching VirtualMachineErrors and from doing a try/finally, the Joe-E code was preemptively terminated immediately on occurrence of a VirtualMachineError. Only the spawner of the Joe-E computation could react to this termination of the computation it spawned.

This mirrors one of the many thing that Erlang gets right. When a program is confused, that program is the last one you want to ask to recover from the confusion, since it is already impaired by its own confusion. If you don't know what is still true, you are unlikely to engage in repair actions correctly. Better to preemptively terminate some large unit containing the confusion and recover by

  • restarting from earlier known good state, or
  • if this is not yet feasible, propagating the termination to a yet larger granularity of computation.

This is the "fail stop" philosophy of "Death Before Confusion". The contrasting philosophy appropriate for some computation is "best efforts". Some JavaScript code is best served by one and some by the other. Security enforcing code must maintain its own integrity at the price of termination (and restart from some coarser grain). Web pages using JavaScript only to spice up the user experience are often best served by best efforts. Erlang itself is an interesting case study, as its original motivating problem -- telephone switches -- places a higher priority on uptime than on integrity. Nevertheless, both Erlang and the Tandem non-stop architecture found that uptime in the large is best served by fail-stop in the small combined with coarser-grain recovery logic.

Because JavaScript comes from such a long legacy of de facto best efforts architecture, I think a direct du jure shift to fail-stop is unlikely. Instead, what we need is a trap-handling mechanism (Erlang "supervisor". KeyKOS "keeper"), where different policies can be expressed by user-defined trap handlers. When multiple policies co-exist, the platform obeys the more severe policies. For concreteness, I'll make here a first sketch:

On OOM, the platform first scans the stack to enumerate all realms represented by in-progress stack frames as of that moment. (In progress meaning that the stack frame still would have been there even if that platform had implemented proper-tail-call.) It gathers the trap handlers associated with each of those realms. Each trap handler is a pair of a string and an optional function.

The string indicates the choice of trap handling strategy, where these strategies are ordered by severity. Among the gathered strategies, the most severe win and the rest are discarded. From least to most severe, they are

"THROW" "ABORT_JOB" "REFRESH" "ABORT_EVENT_LOOP"

Except for "THROW", all the rest cause the current turn/job to first be preemptively terminated without running catch or finally blocks. If during any one trap handling strategy we run out of reserve memory, then we automatically escalate to the next more severe strategy. Alternatively, if a trap handling function is itself associated with yet another otherwise-uninvolved realm with its own trap handler, then an OOM inside this trap handler might be handled by that handler's handler.

This is just a first sketch. It is probably too complicated in some ways and insufficiently general in others. I post it mostly to get the discussion started.

# Filip Pizlo (9 years ago)

It seems that most of the benefit for fail-faster behavior for VM errors is security.

To what extent do you think the security problem could be addressed by VMs simply randomizing the point at which stack overflow or OOM happens? I think this would be more desirable, since it requires no language changes.

More comments inline...

On Sep 27, 2015, at 8:46 AM, Mark S. Miller <erights at google.com> wrote:

The string indicates the choice of trap handling strategy, where these strategies are ordered by severity. Among the gathered strategies, the most severe win and the rest are discarded. From least to most severe, they are

"THROW" "ABORT_JOB" "REFRESH" "ABORT_EVENT_LOOP"

This seems pretty sensible, but I'd like it more if it was simpler.

Wouldn't this be practically as useful if we just had THROW and ABORT_EVENT_LOOP? I can see how to use those modes, but I don't know how to use the others.

# Mark S. Miller (9 years ago)

On Sun, Sep 27, 2015 at 9:57 AM, Filip Pizlo <fpizlo at apple.com> wrote:

It seems that most of the benefit for fail-faster behavior for VM errors is security.

To what extent do you think the security problem could be addressed by VMs simply randomizing the point at which stack overflow or OOM happens? I think this would be more desirable, since it requires no language changes.

It would help, but not enough. The defender would have some window of memory budget within which it randomizes. The attacker could repeatedly probe in order to get a statistical sense of the range and shape of that window. Then, say, the attacker could repeatedly allocate until it got, say, into the 80 percentile of that window without failing, and then call the defender. If the defender then does a delicate operation that happens to allocate more that the remaining 20 percentile of that window, then the attacker has broken the defender's integrity in a way the attacker may be able to exploit.

Nevertheless, it does help a lot. It would be a nice speed bump until a real defense can be designed, agreed on, and put in place.

We also need real experiments to determine how hard these attacks actually are to mount. And once successful, how hard they are to commodify, so that other less skilled attackers can reuse these attacks. As always, if you do this as a white hat, please engage in responsible disclosure for a reasonable finite period before making a successful attack public. Thanks.

This seems pretty sensible, but I'd like it more if it was simpler.

Me too!

Wouldn't this be practically as useful if we just had THROW and ABORT_EVENT_LOOP? I can see how to use those modes, but I don't know how to use the others.

Perhaps. Let's start with your simpler hypothesis. I like it.

# Ron Waldon (9 years ago)

Android has an older onLowMemory() callback and a newer onTrimMemory() callback. iOS has something similar as well.

Is making these available in ECMAScript proper or an annex a potential solution to this class of problem?

# Filip Pizlo (9 years ago)

I don't think that prevents a caller from adversarially injecting - and then catching - faults into a callee in such a way that the caller can control which part of the callee runs and which part doesn't. The ability to catch the fault is what causes the security issues, since the caller can keep running even when the callee was forced to give up.

Being able to detect and act upon low memory conditions is helpful in other ways, but I don't think it prevents the bad scenario from happening.

# Geoffrey Garen (9 years ago)

I don’t object to the idea of levels of severity when throwing an exception, but I don’t think it will be sufficient to defend against attacks either.

An attacker that wants to infer information about the target VM or stop execution in some target code at a point of inconsistent state will still be able to do so even if the attacker can’t immediately catch the exception. The attacker can pre-arrange a user event handler, timer, animation callback, worker, network response, onerror callback, etc. in order to resume execution after the attack has completed.

To some extent, this problem may be unsolvable.

To the extent that this problem can be solved, I think the VM would need to halt all further execution within the webpage/world/realm after an out-of-stack or out-of-heap exception.

Halting execution is draconian, but its downsides are limited by the fact that the webpage was probably not going to function correctly after the exception anyway. The main downside I see to halting all execution is that client-side error analytics will not be able to report on out-of-heap and out-of-stack errors.

Geoff

# Isiah Meadows (9 years ago)

I see potential security benefits on the server side, though (e.g. Node). If someone manages to DDoS a server through a RAM heavy route, that can become a problem where it's safe to take extra precautions to avoid OOM, but the attacker can't add their own hooks without being able to execute arbitrary code (which would require other flaws). It's also possible to restart the server automatically without the process dying in that specific case, and the server won't be in an invalid state inside the engine.

# Mark S. Miller (9 years ago)

On Mon, Sep 28, 2015 at 1:20 PM, Geoffrey Garen <ggaren at apple.com> wrote:

I don’t object to the idea of levels of severity when throwing an exception, but I don’t think it will be sufficient to defend against attacks either.

Agreed that if you're throwing, then you are vulnerable to attacks. The levels-of-severity idea is that THROW is the least severe and all more severe levels do not throw. Let's examine in terms of Filip's suggestion that we only have THROW and ABORT_EVENT_LOOP. If anyone participating in the event loop wants ABORT_EVENT_LOOP then that's what we have. Only ABORT_EVENT_LOOP claims to provide any defense.

An attacker that wants to infer information about the target VM or stop execution in some target code at a point of inconsistent state will still be able to do so even if the attacker can’t immediately catch the exception. The attacker can pre-arrange a user event handler, timer, animation callback, worker, network response, onerror callback, etc. in order to resume execution after the attack has completed.

To some extent, this problem may be unsolvable.

To the extent that this problem can be solved, I think the VM would need to halt all further execution within the webpage/world/realm after an out-of-stack or out-of-heap exception.

Yes, that's what ABORT_EVENT_LOOP must do.

Halting execution is draconian, but its downsides are limited by the fact that the webpage was probably not going to function correctly after the exception anyway. The main downside I see to halting all execution is that client-side error analytics will not be able to report on out-of-heap and out-of-stack errors.

That's a very good point. In systems with this kind of architecture, collaborators outside the terminated unit receive some signal indicating what allegedly went wrong inside the terminated unit. E leverages its promise-based communications model for this purpose. The action of aborting a vat carries a simple data string, not a general object. The vat itself is preemptively aborted without any further computation within the condemned vat. All remote promises into the terminated vat transition into (in JS terminology) rejected promises, where the rejection reason is that string -- the alleged reason for termination.

This does provide a so-called "termination side channel" to these collaborators. But our primary concern is integrity. On confidentiality, neither E nor JS is in a good position to make strong claims about outward signaling[*] on non-overt (side or covert) channels, and we should not try. This side channel is low bandwidth, and only inter-vat collaborators can read it. Those objects at the boundaries between vats already have access to other non-pluggable non-overt channels.

[*] Two caveats:

  1. As Waldemar points out, we must be careful about high bandwidth non-over channels.
  2. E does make strong claims about limiting the listening to non-overt channels, by use of "fail-stop loggable non-determinism". I am becoming hopeful that JS may as well. But this is a topic for another day.