Since I wrote this post, I have become aware of Rich Hickey’s discussion of State and Identity on Clojure.org. I feel he expresses the same truths in a generally clearer, sharper way than I have managed to here…
This post arose out of a Twitter discussion with Daniel Spiewak on twitter. I dont know whether Newton could have encoded his laws of Mechanics into 140 characters, but I need more space to share my thoughts on the (hopefully) simpler matter of: What is appropriate use of Mutable Data in Functional software systems?
Entities and Identities
A classic example, present in almost every business software system, is some “User Entity”, modeling a person’s account in that system. In such a system, when we speak about a particular User, conceptually we really want there to be only one such entity in existence.
The identity of an entity is normally governed by a unique & immutable primary key. Other than the primary key, everything attribute of the entity may change. A User, for example, might change their favorites list, address, name or possibly even sex.
Functional Interaction with Entities
Conceptually there’s only one entity. But actually, there’s only one “offical version”, and other possibly divergent versions, which may be promoted to official status.
Entities should be accessed via transactions (in broadest sense, including STM or concurrent data structures). A functional program makes an immutable snapshot of the entity at some time. It may derive an updated version of the entity from that, and then request that its copy be promoted as the “master” copy. Typically, this promotion would be done using optimistic concurrency, so that if another process changed the entity first, the update must retry. But other models (eg pessimistic concurr, change merging) might be used instead.
- An executing program has no guarantee that it operates on the “latest” version of an entity.
- After the fact, it is possible to say “at time X, the official state of the entity was thus…”, by time-stamping updates. This is important, for example, in a legal case if we wanted to know who owned some property at a particular point in time in a dispute.
- This is either Shared State and Message Passing concurrency. We can view the “official entity version” as shared state, or as an agent to whom we send read-entity and write-entity messages.
Pretty much as Daniel put it (borrowed from Clojure, apparently) “Everything is immutable except for concurrency-safe entities“
The meaning of Entity Update
Something special happens when an entity’s master state is updated. This is the synchronization/convergence point between separate threads and components. It is the boundary between the functional code, which transforms some input state into some new entity states, and the imperative world of sequentially updating mutable state. It is the end of a Unit of Work, a time when a functional program says “this is not a means to an end, this is the end”. The transient becomes persisted, allowing the state in the current moment to affect and interact with other points in time.
Granularity, Scalability and Consistency
So how many entities do we need? Coarse or fine grained? A few larger, internally complex entities, or many simpler and smaller entities?
Its a matter of chosing a point on a spectrum whose extremes are
- A single StateOfTheWorld entity whose attributes encompass all data in the system. As in a purely functional system; the program sees only its own private version of state, without any interference.
- Every tiny piece of data is its own entity. This would be like programming only with atomic global variables. Everyone sees and shares all state without privacy.
I don’t think there’s one “right” answer. All I can offer is an outline of some design forces acting in either direction:
Fewer Entities each containing more data
- For: data consistency within an entity. A coarse grained entity starts in a consistent state and is never externally updated. A program is free to compute changes to the entity’s state independently without risk of being disrupted by other program’s updates. The benefits of the purely functional model accrue.
- Against: Contention if multiple processes try to simultaneously update the same entity. For an extreme example, imagine if two functional programs simulatneously computed a new StateOfTheWorld entity – they would operate in a perpetual state of fatal contention.
- Against: Copying Overhead. Need to copy minimum of log(N) (where N is the “size” of the entity), and often much more, of an entity’s data to update it.
More Entities each containing less data
- For: Less copying overhead. We still need to copy minimum of log(N) of an entity’s data to update it, but N is smaller.
- For: Less contention when multiple processes try to simultaneously update entities, because the updates are spread across more fine-grained locks.
- Against: data inconsistency between inter-related entities. We either have to tolerate inconsistency between entities (eg Person entity ‘Ben’ has a son ‘Otto’, but Person entity ‘Otto’ doesnt have a father ‘Ben’), or introduce an extra layer of locking over the entities (as most databases do). Note that if we inroduce locking above the entity level, our contention benefits go away, and we may be creating de-facto coarse-grain entities.
Its interesting to note that the same trade-off between consistency and scalability shows up many times elsewhere – eBay being a nice example.
Above, I presented Entities as having a globally unique identity. Much of the time, thats a good way to think about them, but be aware it is a simplification. In fact, identity is inherently defined relative to some scope. That scope needn’t be global.
Here are 3 different scopes for the entity The number of people on earth at midnight on Dec 31, 1999.
- The globally true answer that an omnipotent god might know
- My national government’s official figure
- My personal estimate
Scopes may nest. Im still contemplating exactly how this affects the design of systems with mutable data.