« April 2009 | Main | November 2010 »

1 posts from October 2010

October 30, 2010

Immutable Collections

I have been successfully experimenting with functional programming techniques, most importantly immutable data structures, in a sophisticated user interface application project.

Admittedly, I often use imperative algorithms and data structures inside transient operations to reduce memory allocations and GC pressure while generally reserving immutability for the long-living data structures. Imperative algorithms are fine if they do not pierce the functional façade of the method call or class interface. The advantages of functional data structures are their versatility, which include persistence (version support), referential transparency, optimizations, and cheap copies and comparison tests.

In yesterday’s PDC panel with language experts, Anders Hejlsberg expressed skepticism regarding the performance of functional collections. He alluded to Clojure, a highly optimized functional language, in which the array implementation results in a 2X or 3X performance degradation. A discussion I had with two Microsoft developers at a Nerd Dinner similarly resulted in both of them quickly asserting that functional arrays had to run with logarithmic overheard and dismissing their suitability for real world applications.

There is still a widespread belief that using functional data structures will introduce substantial overhead. It’s known that imperative program can be rewritten with a logarithmic slowdown in a functional style. A well-cited paper showed that, under a couple assumptions, that for some contrived algorithms, that this logarithm slowdown is the best one can hope for in a conversion to purely functional algorithm. However, the paper is more of a theoretical than of a practical significance.

It’s generally accepted that any impure algorithm can be turned into a pure algorithm through the use of functional maps, which have logarithm access and update times using balanced binary trees. What is often overlooked, however, functional maps can be made to run in constant access and update times in ephemeral usage, if internals of functional data structures are mutated behind the functional façade.

An ephemeral data structure is one in which only the latest version can be accessed, as opposed to a persistent data structure in which any prior version is also accessible. If a prior versions are also updatable, the data structure is fully persistent, otherwise, partially persistent.  Imperative data structures are naturally ephemeral, since mutation destroys the previous states.

There have been a number of ways to produce efficient functional arrays, most of which uses mutation behind the scenes.

Clojure uses AMTs (array mapped tries) which provides O(log32 n) access and updates for its array-like vector data structure. The implementation uses mutation internally and is optimized for append operations. According to Rich Hickey, this is practically constant-time.

There are actually two serious candidates that are fully persistent and guarantee constant time access and updates in the ephemeral case.

1) Trailer arrays, also known as shallow binding. A trailer array data structure is an  array, conceptually coupled with an undo stack. Accesses and updates to a single version of the array are as fast as imperative arrays. To access or update a different version of the array, there is a cost per context switch as the array is rerooted to hold the image of the new version. This cost is proportional to the number of changes made between the previous and new version. This cost is not really a disadvantage as imperative arrays only offers a single version.

Below is a simple, unoptimized implementation of a trailer array, supporting full persistence, and ideal constant-time running time when used in ephemeral or backtracking. (The Set operation requires a Reroot() operation before return to get worst case constant time as opposed to amortized constant time when used in an ephemeral setting. With the Reroot operation, the GC reachable memory overhead is also minimized. In the ephemeral case, just the array node and the last version object is reachable.)

   1:      public class PArray<T> where T : IEquatable<T>
   2:      {
   3:          Change _change;
   4:          PArray<T> _previous;
   5:   
   6:          public PArray(T[] array)
   7:              : this(new ArrayNode {Array = array}, null)
   8:          {
   9:          }
  10:   
  11:          private PArray(Change change, PArray<T> previous)
  12:          {
  13:              _change = change;
  14:              _previous = previous;
  15:          }
  16:   
  17:          public PArray<T> Set(int index, T element)
  18:          {
  19:              return new PArray<T>(new SetNode { Index = index, Value = element }, this);
  20:          }
  21:   
  22:          public T Get(int index)
  23:          {
  24:              return Reroot().Array[index];
  25:          }
  26:          
  27:          private ArrayNode Reroot()
  28:          {
  29:              var a = _change as ArrayNode;
  30:              if (a != null)
  31:                  return a;
  32:              
  33:              a = _previous.Reroot();
  34:              var diff = (SetNode)_change;
  35:              var v = a.Array[diff.Index];
  36:              a.Array[diff.Index] = diff.Value;
  37:              
  38:              // Alternatively, for concurrency:
  39:              // _previous._change = new SetNode { Index = diff.Index, Value = v, Previous = this }
  40:              diff.Value = v;
  41:              _previous._change = diff;
  42:              _previous._previous = this;
  43:              
  44:              _previous = null;
  45:              _change = a;
  46:              return a;
  47:          }
  48:   
  49:          public abstract class Change
  50:          {
  51:          }
  52:   
  53:          private class ArrayNode : Change
  54:          {
  55:              public T[] Array;
  56:          }
  57:          
  58:          private class SetNode : Change
  59:          {
  60:              public int Index;
  61:              public T Value;
  62:          }
  63:      }

2) Fat nodes. This approach is explained in A New Method For Functional Arrays and is essentially array containing fat nodes, housing a version-keyed tree of values. These arrays are well-balanced in that they are slightly slower than ephemeral arrays, but have no switching costs between versions. They do consume more memory, but are better suited for fully persistent scenarios involving simultaneous operations with multiple versions.

The techniques, trailer arrays and fat nodes, are generally applicable to any imperative data structure that one would want to make persistent and observationally immutable.

There are some concerns when using mutation behind the scenes:

1) Meld operations surges as merges may not be easily supported.

2) Concurrency is no longer free and will need to implemented in.

3) Data structures may no longer be compact. Some dead versions may still be GC reachable.