Based on a manifesto by Kirk Glerum, Office started thinking about reliability around 2000. As a result, a new improved Dr. Watson identified application crashes and hangs and relayed errors back to Microsoft, so developers could fix the most common errors. One statistic has that 50% of crashes were caused by 1% of the bugs. Office also added document recovery, in which documents are periodically auto-saved. Documents may actually be saved in the midst of a crash, risking potentially corruption. More recently with the open XML formats, these new file formats were designed to ease recoverability and limit corruption. Additional features in Windows Vista now support shadow copies, etc.
This approach to addressing reliability is rather crude. A more thorough approach is to address the issue in a more fundamental way like using a different programming style. Such a thinking has lead me to the use of a more functional style of programming, characterized by onetime assignments and persistent data structures. With mutable data structures, any single misplaced or incorrect assignment could introduce inconsistency into the system. While it is also possible that functions can also return incorrect values, the lack of side effects keeps errors isolated from the rest of the system; proper sequencing is not an issue for function calls as is for assignments. When an exception occurs, I do not have to undo any state changes. I can simply forget about the work that lead to the exception and revert to the last known good copy of my data. My application no longer needs to abort, since I traveled instantly back in time.
Erlang is one of those hot new languages right now. It’s declarative, functional, and concurrent. Damien Katz says that reliability is why he chose Erlang, citing an article on Erlang in Byte.com.
Meanwhile, back at Ericsson, some Erlang-based products that were already in progress when the "ban" went into effect came to market, including the AXD 301, an ATM switch with 99.9999999 percent reliability (9 nines, or 31 ms. downtime a year!), which has captured 11% of the world market. The AXD 301 system includes 1.7 million lines of Erlang: This isn't just some academic language.
Reliability is a major design philosophy as revealed in this thesis on Erlang, “making reliable distributed systems in the presence of software errors.” (The powerpoint version.) The programming model espoused by Erlang makes extensive use of “fail-fast process,” also known as “Let it crash!” When a function encounters an error, it cannot handle, it should simply fail, in which case the caller, which is using the original copy of the state can decide whether to restart the function, attempt a simpler routine, or itself fail.
The process approach to fault isolation advocates that the process software be fail-fast, it should either function correctly or it should detect the fault, signal failure and stop operating. Processes are made fail-fast by defensive programming. They check all their inputs, intermediate results and data structures as a matter of course. If any error is detected, they signal a failure and stop.
Damien Katz more fully expounded on the Erlang approach to handling exceptions in his popular post on “Error codes or Exceptions? Why is Reliable Software so Hard?” He writes about the dangers of “Reverse the Flow of Time” error handling, made necessary by the use of mutable data or “destructive update.” This problem is more insidious if an exception occurs during the exception handling.
The better approach is not to “undo your actions, [but] just forget them,” an approach not natively supported by any of the popular languages. By eliminating destructive updates, actions do not need to be undone, just forgotten. Damien first points to two approaches to achieving this are to (1) make a copy of everything up-front (low-tech but expensive) or (2) make objects immutable. Alternative, one can keep keep object mutation to a single operation in isolation—atomic update; here, undoing is limited to a single operation.
The ideal approach is to use a functional programming language, for which Damien recommends Erlang:
Erlang, which started me thinking about these issues, is a functional programming language that gets reliability right in a simple and elegant way that I think is fairly easy for an experienced OO programmer to pick up (you don't even have to learn about monads). Erlang is different in that it's far more dynamic and "scripty" than other functional languages (it even has an interactive command line mode), making the development process more incremental and approachable.
Erlang is marvelously beautiful in the way it meshes the concepts of immutability, messaging, pattern matching, processes and process hierarchy to create a language and runtime where extreme concurrency and reliability means adhering to a few simple design principles. Someday I'm going to explain the whole Erlang development philosophy and why it's so damn awesome.
Finally Damien ends with the same conclusions I made about software reliability.
The bigger problem in software reliability isn't how we communicate errors, it's the state we are in when the error happens. Any attempts to "Reverse the Flow of Time" in code are bad. Avoid it. Instead convert your code to avoid mutations and use "Get the Hell out Of Dodge" error handling instead. You'll thank me later.
If we look at how Office does document-recovery, it is essentially through copying the data by “Auto Saving” at regular, short intervals. In the threading space, a newly proposed method for replacing locks, called “software transactional memory,” works by copying the original values of the data structure to be changed; despite the overhead of copying, this lock-less approach actually outperforms and scales better than locks on multiprocessors.
A point that I failed to mentioned is that Erlang assumes that software will have errors in them. The trick is to reliably handle such errors.
The traditional approach is to find all such errors, because any such errors could be ruinous to the application. However, an alternative approach is to make the application more resilient to these types of "ruinous" errors.
Posted by: Wesner Moise | June 13, 2006 at 06:14 PM
Hm, very interesting, Wes.
Ok, so the improvement here isn't fewer bugs, but more reliable software that doesn't crash when it hits bugs.
I think that's a great improvement, but what about getting fewer bugs in software? At work, the bugs in our software -- probably 99% of the time -- are coding errors. When our software fails, it's almost always due to programming error. In the Erlang way, it would just retry my component that failed, then the parent component, and so on until the whole thing restarts. That's a bad system for us! With 99% of the time being programming error, restarting components isn't going to help in the least. The places where non-programming errors occur (such as network timeouts), well, a simply try/catch/error message is simple enough. We've designed our software to retry in such scenarios as well before issuing any errors. And even the error messages provide an easy way for users to retry such operations.
All that said, this is the first time hearing of Erlang, so I'm no expert. But after reading his blog posts on it I think it's clear Erlang does not attempt to address the "fewer bugs" problem that plagues most software, or at least, plagues our real-world software.
That's one thing I'm hoping your tool will help us with, Wes. I want fewer programming errors in our software. So hurry up with that beta! :)
Posted by: Judah | June 14, 2006 at 09:34 AM
I guess the main issue is fault tolerance for certain types of applications -- in this particular case, ATM machines. So we must assume there will be bugs, but the application can resurrect itself in the face of them...
One compiler company got in trouble for leaving asserts on in released code. We don't see debug asserts now, because they are turned off, but it pretty confirms my belief that in reality software that appears to be performing perfectly well may in fact be encountering frequent bugs. In such a case, the fault tolerance of the application is just as important as minimizing the number of defects in the software.
Posted by: Wesner Moise | June 14, 2006 at 11:51 PM
I suppose you're right there, at least for the ATM instance. While I agree fault tolerance is important for software, most time spent as developers here at work is fixing coding errors. We can pretty well keep an app running so long as things like out of memory errors aren't happening (but hey, if the system's out of memory, things beyond our control are going wrong).
Posted by: Judah | June 15, 2006 at 09:36 AM
Hi Wesner, you might be interested in Juval Löwy's MSDN Magazine article titled "Volatile Resource Managers in .NET Bring Transactions to the Common Type" (http://msdn.microsoft.com/msdnmag/issues/05/12/transactions/default.aspx).
He introduces an approach to use transactions to keep objects in-memory in a consistent state.
Posted by: Erwyn van der Meer | June 15, 2006 at 06:07 PM
I am actually familiar with that .NET transactions for volatile data, as I attended a lecture by a program manager in the transactions team in 2004.
Posted by: Wesner Moise | June 15, 2006 at 07:54 PM