What's Missing In .NET and Other Collection Libraries
Collections are probably one of the most important components of any library. The framework's collection classes, as of Everett, are somewhat weak, especially compared with Java and STL. With the recent work with collections in Whidbey, much of the gap in functionality with Java will be reduced but not eliminated, although the integration and performance of generics collection should be superior to Java's collection libraries. Some new collections, like linked lists, will be introduced, but I haven't seen any indications of sets or other tree-based data structures from Java being introduced.
STL.NET, a .NET port of STL, will also be introduced by the VC++ teams, and indications are that the library will be usable by other managed languages; unfortunately, the style and naming conventions of that library are the same as that of native STL and will be incompatible with the framework's own style and naming guidelines.
All of these libraries, including STL, are missing an important type of collection--an indexable list with fast (logarithmic or better) insert and deletion times. These libraries simply offer a tradeoff--either the indexing time or the insertion/deletion time must take linear time. Many applications will simply stick with a simple array or ArrayList and accept the linear time performance. The result is the widespread phenomenon where a user will supply a large, but reasonable, amount of data into an application, and that application will soon slow to a crawl.
Many commercial applications will try to be a little more sophisticated by creating a list of lists, which still delivers linear time performance, but improves insertion time substantially by a constant factor. Such an approach still doesn't scale well after a certain point. For random access of elements in large sequential data, true industrial-strength applications will use a logarithmic-time list implementation, which is usually based on a tree or possibly a skip-list.
I wrote a general-purpose list, based on a balanced tree implementation. The list is broken up into chunks, each of which are of length 1 to some maximum specified chunk size. The chunks are all stored as nodes in the binary tree. In addition to the chunks, the nodes also contain a relative index value for keeping track of each node.
Insertion and deletion time are O( lg(n/chunkSize) ), which is practically instantaneous. Indexing time, a little slower, is also logarithmic O( lg(n/chunkSize) ). However, using an enumerator to sequentially access a range of elements via a tree walk yields constant time access per element on average. This leads to the interesting conclusion, that foreach is actually faster than indexing inside a for loop. Foreach takes O(m) on average to access m items, while for takes O( m lg n ) for the same set of items.
This implementation took me less than a week; in addition, I was able to add additional capabilities such as the ability to compactly store a run of elements, making this list also suitable for a run array or a sparse array. With trees, a lot of nice features become possible.
For the small amount of time it took me to write this implementation, I am surprised we don't have such list implementations in any library. It seems to me that a general-purpose scalable list class should be in all the standard collection libraries. It will probably be the case five years from now, but, for now, I am considering publishing my own implementation, as soon as I have converted it to use Whidbey generics and have worked with it for a few months to be sure all the bugs are out.
I'm with you!!! one of the things that I miss the most in .NET (I was/am a C++ developer) is the rich sets of collections. So please (PLEEEEEAAASEEE!!!!!!!) if you can talk the whidbey team to include the same rich collections that we have in STL, I wil be forever (and beyond) grateful.
Posted by: Juan Felipe Machado | February 25, 2004 at 06:04 AM
Sounds like a great component to release. It'd be great if you could release it for v1.1 too...
Posted by: Gary Pupurs | February 26, 2004 at 08:13 PM
This is in my top 5 "what's wrong with .NET" list.
As a former (and somewhat current) Java developer, I can say that the .NET folks don't know what they are missing in terms of productive development of efficient apps.
I was appalled at the lack of linked lists and true hash tables (the .NET pattern of collision handling by remap instead of using buckets breaks down badly)
Compound this with the lack of good documentation on collections (algorithms, O(n) notations, exceptions to performance, boundary conditions) and all in all it's a very disappointing showing.
Hope I'll be happier with Whidbey, or that the community comes through on this one...
(it's a shame there's no need for my fancy linked list though heh heh)
Posted by: David Goldstein | February 29, 2004 at 07:32 PM
While containers are an important part of the stl, an equally important part are the algorithms. Do you know if this STL.NET will include algorithms to work on top of the containers? Also, will the STL.NET outline the performance requirements of the different algorithms and containers like the STL does?
Posted by: Ben | March 03, 2004 at 10:22 AM
STL.NET is essentially identical to STL, except for being a managed library with managed types. It is implemented with a combination of generics and templates.
It should feel and perform the same in C++, but there will be reduced functionality for other languages which only support generics.
I believe the new System.Collection.Generics library may actually include new algorithms that work on top of containers.
Posted by: Wesner Moise | March 03, 2004 at 01:47 PM