Persistent Data Structure
A persistent data structure is referred to a data structure whose underlying structure cannot be modified. Every modification will result in a new instance of the type with the updated values. In another word, such a data structure is immutable.
Why do we care?
In short, it’s all about parallelism. Google and read why we want to write code running in parallel.
Because immutable objects cannot be modified, it forces each thread of the program uses its own copy of data, which facilitates safe, correct, and concurrent code support parallelism.
Shared and mutable data structures, on the other hand, require synchronization, which is accomplished by locking the object, and normally has the following challenges:
- very hard to reason about what will happen in run time;
- very hard to debug and write unit test;
- potential race condition;
- possible deadlocks;
What about Readonly Collection?
In C#, Readonly Collection, is a wrapper for the underlying collection type. For example, ReadOnlyCollection<T>, which is a wrapper of List<T>. The ReadOnlyCollection will prevent being modified; however, if the underlying collection is modified, the ReadOnlyCollection is changed too. The following code snippet shows how to update a ReadOnlyCollection (run this in LinqPad):
List dinosaurs = new List();
ReadOnlyCollection readOnlyDinosaurs = new ReadOnlyCollection(dinosaurs);
// this so-called readonly collection is now modified
In contrast, ImmutableList is a Persistent Data Structure. It will return a new instance of a list with the updated values, but the original list is intact.The following code snippet shows what the difference is:
List dinosaurs = new List();
ImmutableList immutableDinosaurs = dinosaurs.ToImmutableList();
var modifiedImmutableDinosaurs = immutableDinosaurs.Add("T-rex");
What about performance and memory space?
Since updating an immutable list returns a new list, does it mean it will take more memory space? Not really. To understand this, the best way is going through the implementation of ImmutableList at here: https://github.com/dotnet/corefx/tree/master/src/System.Collections.Immutable/src/System/Collections/Immutable
Internally, ImmutableList is a LinkedList with a strategy called “Shared Structure” or “Structure Sharing”. Basically, it means, all (Immutable) copies of the original list share some nodes. In the above example, immutableDinosaurs and modifiedImmutableDinosaurs share 4 nodes, but modifiedImmutableDinosaurs has one additional node. The variable of an immutable data object is just a reference of the head of the linked list. Therefore, there is really no significant extract space allocation.
How can it be thread-safe if we can actually update it?
You may notice that multiple threads can access the same immutable object. Because it always stays the same, “updating” it from multiple threads is safe with requires no lock. After all, all threads will simply perform just read operation. However, when a new list is constructed, a technical called Compare-and-Swap (CAS) is used. This operation is atomic and the outcome is either done or not done. CAS is a machine level instruction that requires no lock using a shareable object. Google for more details if interested.
Other Thread-Safe Data Structure?
.NET has many other data structure support thread-safe processing, including Concurrent Collection (ConcurrentBag, ConcurrentDictionary, etc), which is designed to work better in concurrent computation.
Concurrency vs Parallelism
To understand those 2 concepts, we first should understand what sequential computation is. Actually, I believe everyone can easily understand what a sequential computation is:
Concurrency, on the other hand, is referred to 2 or more tasks are executed at the same time. Using Barista taking orders at a coffee shop as an example:
The barista can perform grinding bean and boiling water those 2 tasks concurrently, and then move to next task(s). Those 2 tasks are independent to each other.
Finally, Parallelism can be seen there are more than 2 baristas, so we can server more customers in parallel.
In sum, using immutability offers
- lock-free and thread-safe implementation;
- enabling shared structure to minimize GC effort, resulting higher performance in parallel computation;
C# has mutability as default behavior, but interestingly, the implementation of string, is actually immutable.
F# on the other hand, is default to immutability, meaning, every variable is immutable. Changing it will result in a new copy of it. There is no null. There is no need to lock.