Recovery Domains: An Organizing Principle for Recoverable Operating Systems

Andrew Lenharth, Samuel T. King, Vikram Adve

Abstract:

We describe a strategy for enabling existing commodity operating systems to recover from unexpected run-time errors in nearly any part of the kernel, including core kernel components. Our approach is dynamic and request-oriented, in the sense that it isolates the effects of a fault to requests that cause a fault rather than to static kernel components. The approach is based on a notion of ``recovery domains,'' an organizing principle to enable partial rollback of affected state within a request in a multithreaded system. We have applied this approach to the Linux kernel and it required less than 126 lines of changed or new code: the other changes are all performed by a simple instrumentation pass of a compiler. Our experiments show that the approach is able to recover from otherwise fatal faults with minimal collateral impact during a recovery event.

To Appear:

"Recovery Domains: An Organizing Principle for Recoverable Operating Systems"
Andrew Lenharth, Samuel T. King, and and Vikram Adve.
Proceedings of the Fourteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '09), Washington DC, March, 2009.

Download:

Paper:

Recovery Domains: An Organizing Principle for Recoverable Operating Systems (PDF)