Last April, I wrote about a profound design change I was making into my document-based applications, which was to make all my document data structures immutable--standard practice in some functional programming languages.
My original motivation was to make each of my model objects an expression that I could work with in a functional way and apply transformation rules to. (My first attempt a few years ago failed because I earlier assumed that mutability was required for editable documents.) The result is substantially less code and tight integration between my document and my AI framework. From the user's point of view, the application would appear more intelligent, permitting more complex document analysis and manipulations than seen in the past.
The document essentially becomes a value. The ease of changing a document in this fashion is comparable to using a TextBox control, which is simply accessed and modified by getting and setting the Text property of the TextBox control.
There is an additional cost of path copying from the root of the document tree to the changed node, but this is more than offset by a number of other direct benefits such as trivial multithreading, elimination of undo work, and other simplified algorithms (eg, easier document comparison/merging). Maintaining an undo log is actually expensive, when the document changes are high during an operation. My algorithms tend to be simpler because of the immutability variant and the access to the before and after state of the operation. My data structures became simpler, because they only need to be descriptive; anything extra is typically caching computed information. I no longer needed to store parent or other information, which enables extensive sharing.
There are some complications. I have lost object identity and each node has to be the document tree now needs to be referred to by its path, but the path can be encapsulated inside its own object and it turns out to be rarely need. Object identity presents problems when persisting, deleting, copying or pasting data (especially multiple times). If identity is really important, the node should contain an ID.
I also don't store an observers collection; I don't need to. After an entire operation is completed, I present the root view with the document before and after an operation. The view proceeds recursively downward to update its own nested views.
Intuitively, this approach makes sense. The model should be pure data, just as a string is in a TextBox ; it shouldn't containing any references to any view. The view should be able to rebuild itself based on the new and old states of the data. If a recently document node contains many children, there is a logarithmic time approach to compare before and after state of document node and find the range of children that were changed that I mentioned in a previous post.
I recently also came to the conclusion that my view data structures should also be immutable. In the course of researching the impact of changing the view, I noticed that a number of special purpose classes would no longer be needed resulting in a substantial decrease in code. I also discovered that I could dispense with a parent pointer and a reference to the model object in the view; these two could be externalized and provided by the parent. This also enables flyweights as well as virtual views, where a rendered model node is not actually backed by its own view node.
The changes are likely to improve performance because the modest costs of path copying are offset by the elimination of additional other processing (and accompanying allocations) and the impact of heavy sharing. The other processing includes layout, notifications between the model and view, and initialization/disposal code. Perhaps immutable objects could make Avalon less complex; however, Avalon might need to forgo the naturalness of property assignment.
This reminds me a blogger post that design-patterns aren't needed in functional programming, because they exist to make up for deficiencies in procedural languages.
I'm trying to grok your approach. Assuming I have a large object graph, are you saying that the graph is not connected by pointers? It is connected by IDs and then to traverse the graph the ID needs to be looked up in a dictionary?
That sounds extremely slow.
The reason I ask is that I have a system with ~30,000 nodes in a graph which is rendered on the screen. A couple of years ago, I remodeled the class hierarchy to be more like Avalon....
Posted by: RichB | September 08, 2006 at 12:18 AM
In a system with 30,000 nodes, you probably won't need any IDs.
Object identity is not actually important in referentially transparent world; it's usefulness is tied to side-effects.
Pinpointing of a node in a tree is done by passing around path information. This isn't as common either. It's main use in my application would be identifying the selected region of a document.
The only use for IDs that I can think of in a Word-like application is the cross-references feature, which is not an often used feature.
Indexing of IDs is an option. You only (de-)index the subforest of data in the document tree that has changed in the (old) new tree. You won't necessarily have to search down to the leaves--just the middle region where all the changes occured. You can do the indexing anytime, preferably after idle or, lazily, at some point when the ID is requested.
Posted by: Wesner Moise | September 08, 2006 at 03:07 AM
I have been struggling with similar concepts, though I have been looking at data/delta rather than old/new data structures.
Have you read of Huet's Zipper?
http://www.st.cs.uni-sb.de/edu/seminare/2005/advanced-fp/docs/huet-zipper.pdf
Posted by: damien morton | September 11, 2006 at 02:22 PM
Design patterns are signs of weakness in programming languages link in http://newbabe.pobox.com/~mjd/blog/prog/design-patterns.html.
This is not the original link that I was referring to in my post, but it's points are somewhat similar.
Posted by: Wesner Moise | September 21, 2006 at 10:14 PM
I haven't read the Zipper link, but I'll get back to you damien within a month.
Posted by: Wesner Moise | September 21, 2006 at 10:15 PM
As for the data/delta issue, I may indeed use that approach to conserve memory. The delta would be constructed by comparing the old and new data structure at idle, and later allow the new data to be reconstructed on demand while using less memory.
Posted by: Wesner Moise | September 21, 2006 at 10:21 PM