Microsoft released a preliminary version of immutable collections with mutable performance. It uses many of the same performance tricks that I used to build functional collections such as supporting freezable binary tree data structures. The algorithm used is similar to AVL trees in that it minimizes the difference in heights of two child trees. Red-black trees are not used probably because the number of rotations will still cause the O(lg(n)) allocations in an immutable setting and these allocations are a function of the height of the whole tree..
However, one major difference is the lack of support for constant equality comparison and fast set operations, each of which will limit the usage scenarios for the collection. Set operations in the Microsoft library are O(m lg(m+n) ), but can be implemented in O(m lg(n/m)) and even faster with constant-time equality comparisons O(d lg(n/d)), where d is the minimum of the number of differences between the two sets and m. A common problem which this helps is quickly determining all the changes between two different versions of the same collections, but I often use another solution with O(d). This is important for using immutable data structures pervasively such in a layout engine, where there are potentially many items and the interface needs to remain responsive.
At some point in the future, I plan to open-source my functional collections and other data structures as they are typically the fastest implementations available. In addition, multiple different strategies can be used to optimized for different types of usages such as full persistence, back-tracking, and ephemerality.
UPDATE: I just pulled the performance numbers from my head from a while back, but I think I made a mistake. They look wrong to me. I wrote quickly, and, on reflection, the Microsoft libraries should probably be O(m lg (n+m )), not O(n+m), for the union, intersection, etc operations. I also need to verify the bounds for mine. I’ll probably post a correction with a detailed analysis in the near future.
Thanks for sharing. I'm curious whether your statement about the lack of fast set operations is referring to how the currently implementation implements the set comparison methods (IsSupersetOf, etc.). Can you comment?
Also, I hope that when you do release your version, that you can implement the interfaces we define in this one so that it maximizes interoperability across implementations.
Posted by: Andrew Arnott | March 14, 2013 at 07:24 AM
I'll post more about this, but yes I am referring to operations such
Union, Intersect, ExclusiveUnion, Difference, IsSuperset with another ImmutableSet.
These are currently O(n) when performing the operations on two ImmutableSets. This can actually be a significant cheaper operation, given the representation.
Posted by: Wesner Moise | March 14, 2013 at 10:39 PM
I don't think that you can achieve "significantly cheaper" than O(n) time in general for the set operations you mentioned. It takes at least that time to even enumerate a set, so how could you possibly establish whether one is a superset of another *in general* without enumerating it?
Now, if the sets were sorted, I can imagine some optimizations that could be very effective in many cases, but even then, I *think* the general case is still O(n) at best. Am I wrong?
Posted by: Andrew Arnott | March 17, 2013 at 09:18 PM
About the cheaper set operations, the representation matters.
See Fast Set Operations on Treaps. http://www.cs.cmu.edu/~scandal/papers/treaps-spaa98.pdf
With an ienumerable, it will always be Omicron(n). But if you perform set operations on treaps or some other tree-like datastructure, the set operation be written recursively, so that if two immutable nodes from both sets compare equal, then the operation can be aborted early. You don't need to look at all the nodes; only the ones that are different.
The set representation that I use, treaps, produces a unique stable tree structure for the set of values it holds. With hash consing of nodes and equality canonicalization (ie, if 2 separate nodes are equal, change the reference to the younger node to point to the older node -- similar to union-find algorithm, which is effectively constant time), equality tests can effectively be made constant time on average.
Posted by: Wesner Moise | March 17, 2013 at 11:39 PM
Pardon my sloppiness - O(n log(m)) not O(n) from my first comment.
Also, the general case, actually does involve looking at every node... but if you the two trees were related to each other in some way, for example modified from a common older version of the tree. Some of the nodes in the tree will probably have been shared.
If the trees are related in other ways. The set values in each tree were compiled from similar methods or datasets, then the equivalent nodes wouldn't had the same reference, but equality normalization would make sure a second equality test would succeed immediately though the first may not.
Posted by: Wesner Moise | March 17, 2013 at 11:49 PM
Wesner,
Regarding your optimizations, I still hold that in the general case you can't be better than O(n). Optimizations can certainly bring down the average time, and in fact one earlier version of our immutable collections did in fact use the technique you mentioned to reduce equality checks between two similarly structured sets by recognizing reference equality of the tree nodes.
We removed this functionality (for now) because value equality is hard to define, which we have blog posts discussing.
Posted by: Andrew Arnott | March 21, 2013 at 09:19 AM