Software Reliability

« Text Mess in .NET | Main | XLinq StreamingElement »

June 13, 2006

Software Reliability

Based on a manifesto by Kirk Glerum, Office started thinking about reliability around 2000. As a result, a new improved Dr. Watson identified application crashes and hangs and relayed errors back to Microsoft, so developers could fix the most common errors. One statistic has that 50% of crashes were caused by 1% of the bugs. Office also added document recovery, in which documents are periodically auto-saved. Documents may actually be saved in the midst of a crash, risking potentially corruption. More recently with the open XML formats, these new file formats were designed to ease recoverability and limit corruption. Additional features in Windows Vista now support shadow copies, etc. 

This approach to addressing reliability is rather crude. A more thorough approach is to address the issue in a more fundamental way like using a different programming style. Such a thinking has lead me to the use of a more functional style of programming, characterized by onetime assignments and persistent data structures. With mutable data structures, any single misplaced or incorrect assignment could introduce inconsistency into the system. While it is also possible that functions can also return incorrect values, the lack of side effects keeps errors isolated from the rest of the system; proper sequencing is not an issue for function calls as is for assignments. When an exception occurs, I do not have to undo any state changes. I can simply forget about the work that lead to the exception and revert to the last known good copy of my data. My application no longer needs to abort, since I traveled instantly back in time.

Erlang is one of those hot new languages right now. It’s declarative, functional, and concurrent. Damien Katz says that reliability is why he chose Erlang, citing an article on Erlang in Byte.com.

Meanwhile, back at Ericsson, some Erlang-based products that were already in progress when the "ban" went into effect came to market, including the AXD 301, an ATM switch with 99.9999999 percent reliability (9 nines, or 31 ms. downtime a year!), which has captured 11% of the world market. The AXD 301 system includes 1.7 million lines of Erlang: This isn't just some academic language.

Reliability is a major design philosophy as revealed in this thesis on Erlang, “making reliable distributed systems in the presence of software errors.” (The powerpoint version.) The programming model espoused by Erlang makes extensive use of “fail-fast process,” also known as “Let it crash!” When a function encounters an error, it cannot handle, it should simply fail, in which case the caller, which is using the original copy of the state can decide whether to restart the function, attempt a simpler routine, or itself fail.

The process approach to fault isolation advocates that the process software be fail-fast, it should either function correctly or it should detect the fault, signal failure and stop operating. Processes are made fail-fast by defensive programming. They check all their inputs, intermediate results and data structures as a matter of course. If any error is detected, they signal a failure and stop.

Damien Katz more fully expounded on the Erlang approach to handling exceptions in his popular post on “Error codes or Exceptions? Why is Reliable Software so Hard?” He writes about the dangers of “Reverse the Flow of Time” error handling, made necessary by the use of mutable data or “destructive update.” This problem is more insidious if an exception occurs during the exception handling.

The better approach is not to “undo your actions, [but] just forget them,” an approach not natively supported by any of the popular languages. By eliminating destructive updates, actions do not need to be undone, just forgotten. Damien first points to two approaches to achieving this are to (1) make a copy of everything up-front (low-tech but expensive) or (2) make objects immutable. Alternative, one can keep keep object mutation to a single operation in isolation—atomic update; here, undoing is limited to a single operation.

The ideal approach is to use a functional programming language, for which Damien recommends Erlang:

Erlang, which started me thinking about these issues, is a functional programming language that gets reliability right in a simple and elegant way that I think is fairly easy for an experienced OO programmer to pick up (you don't even have to learn about monads). Erlang is different in that it's far more dynamic and "scripty" than other functional languages (it even has an interactive command line mode), making the development process more incremental and approachable.

Erlang is marvelously beautiful in the way it meshes the concepts of immutability, messaging, pattern matching, processes and process hierarchy to create a language and runtime where extreme concurrency and reliability means adhering to a few simple design principles. Someday I'm going to explain the whole Erlang development philosophy and why it's so damn awesome.

Finally Damien ends with the same conclusions I made about software reliability.

The bigger problem in software reliability isn't how we communicate errors, it's the state we are in when the error happens. Any attempts to "Reverse the Flow of Time" in code are bad. Avoid it. Instead convert your code to avoid mutations and use "Get the Hell out Of Dodge" error handling instead. You'll thank me later.

If we look at how Office does document-recovery, it is essentially through copying the data by “Auto Saving” at regular, short intervals. In the threading space, a newly proposed method for replacing locks, called “software transactional memory,” works by copying the original values of the data structure to be changed; despite the overhead of copying, this lock-less approach actually outperforms and scales better than locks on multiprocessors.

Comments

© 2015 - Wesner P. Moise, LLC. All rights reserved.

free web stats