Appropriate Use of Mutable Data in Functional Software Systems

Since I wrote this post, I have become aware of Rich Hickey’s discussion of State and Identity on Clojure.org. I feel he expresses the same truths in a generally clearer, sharper way than I have managed to here…

++++++++++++++++++

This post arose out of a Twitter discussion with Daniel Spiewak on twitter. I dont know whether Newton could have encoded his laws of Mechanics into 140 characters, but I need more space to share my thoughts on the (hopefully) simpler matter of: What is appropriate use of Mutable Data in Functional software systems?

Entities and Identities

I feel the answer is inter-connected with the idea of Identity represented by the data in a software system. As Eric Evans explained so well, data that has a strong identity is called an Entity.

A classic example, present in almost every business software system, is some “User Entity”, modeling a person’s account in that system. In such a system, when we speak about a particular User, conceptually we really want there to be only one such entity in existence.

The identity of an entity is normally governed by a unique & immutable primary key. Other than the primary key, everything attribute of the entity may change. A User, for example, might change their favorites list, address, name or possibly even sex.

Functional Interaction with Entities

Conceptually there’s only one entity. But actually, there’s only one “offical version”, and other possibly divergent versions, which may be promoted to official status.

Entities should be accessed via transactions (in broadest sense, including STM or concurrent data structures). A functional program makes an immutable snapshot of the entity at some time. It may derive an updated version of the entity from that, and then request that its copy be promoted as the “master” copy. Typically, this promotion would be done using optimistic concurrency, so that if another process changed the entity first, the update must retry. But other models (eg pessimistic concurr, change merging) might be used instead.

Note that

  • An executing program has no guarantee that it operates on the “latest” version of an entity.
  • After the fact, it is possible to say “at time X, the official state of the entity was thus…”, by time-stamping updates. This is important, for example, in a legal case if we wanted to know who owned some property at a particular point in time in a dispute.
  • This is either Shared State and Message Passing concurrency. We can view the “official entity version” as shared state, or as an agent to whom we send read-entity and write-entity messages.

Pretty much as Daniel put it (borrowed from Clojure, apparently) Everything is immutable except for concurrency-safe entities

The meaning of Entity Update

Something special happens when an entity’s master state is updated. This is the synchronization/convergence point between separate threads and components. It is the boundary between the functional code, which transforms some input state into some new entity states, and the imperative world of sequentially updating mutable state. It is the end of a Unit of Work, a time when a functional program says “this is not a means to an end, this is the end”. The transient becomes persisted, allowing the state in the current moment to affect and interact with other points in time.

Granularity, Scalability and Consistency

So how many entities do we need? Coarse or fine grained? A few larger, internally complex entities, or many simpler and smaller entities?

Its a matter of chosing a point on a spectrum whose extremes are

  • A single StateOfTheWorld entity whose attributes encompass all data in the system. As in a purely functional system; the program sees only its own private version of state, without any interference.
  • Every tiny piece of data is its own entity. This would be like programming only with atomic global variables. Everyone sees and shares all state without privacy.

I don’t think there’s one “right” answer. All I can offer is an outline of some design forces acting in either direction:

Fewer Entities each containing more data

  • For: data consistency within an entity. A coarse grained entity starts in a consistent state and is never externally updated. A program is free to compute changes to the entity’s state independently without risk of being disrupted by other program’s updates. The benefits of the purely functional model accrue.
  • Against: Contention if multiple processes try to simultaneously update the same entity. For an extreme example, imagine if two functional programs simulatneously computed a new StateOfTheWorld entity – they would operate in a perpetual state of fatal contention.
  • Against: Copying Overhead. Need to copy minimum of log(N)  (where N is the “size” of the entity), and often much more, of an entity’s data to update it.

More Entities each containing less data

  • For: Less copying overhead. We still need to copy minimum of log(N)  of an entity’s data to update it, but N is smaller.
  • For: Less contention when multiple processes try to simultaneously update entities, because the updates are spread across more fine-grained locks.
  • Against: data inconsistency between inter-related entities. We either have to tolerate inconsistency between entities (eg Person entity ‘Ben’ has a son ‘Otto’, but Person entity ‘Otto’ doesnt have a father ‘Ben’), or introduce an extra layer of locking over the entities (as most databases do). Note that if we inroduce locking above the entity level, our contention benefits go away, and we may be creating de-facto coarse-grain entities.

Its interesting to note that the same trade-off between consistency and scalability shows up many times elsewhere – eBay being a nice example.

Non-Global Identity

Above, I presented Entities as having a globally unique identity. Much of the time, thats a good way to think about them, but be aware it is a simplification. In fact, identity is inherently defined relative to some scope. That scope needn’t be global.

Here are 3 different scopes for the entity The number of people on earth at midnight on Dec 31, 1999.

  1. The globally true answer that an omnipotent god might know
  2. My national government’s official figure
  3. My personal estimate

Scopes may nest. Im still contemplating exactly how this affects the design of systems with mutable data.

3 Comments

  1. June 15, 2009 at 11:34 pm

    I’m not sure that there is a cut-and-dry answer to this question. Entities of this sort really don’t *require* mutability. After all, we can always just throw the entitities into a State monad. That way, everything is immutable, but the one, true entity *seems* to change over time.

    I think that it’s all a question of expressiveness. If one facet of your system models better (more maintainably, more efficiently, or really anything subjectively “better”) using mutable data, then go for it! It’s not a pure approach to functional programming, but it is a pragmatic one.

    One example of a system where I did this recently is my GLL Combinators project (http://github.com/djspiewak/gll-combinators). In a nutshell, the framework provides a parser-combinator interface to a parsing algorithm which can handle much more than conventional parser combinators (left-recursion, ambiguity, etc). The underlying algorithm *conceptually* uses a mutable dispatch stack to manage simultaneous parse paths. It is of course possible to do this in a functional way, updating the immutable stack by returning a new copy. However, it is *easiest* to simply allow this one facet of the system to be non-functional with a mutable stack.

  2. benhutchison said,

    June 16, 2009 at 10:59 pm

    @daniel “Entities of this sort really don’t *require* mutability. After all, we can always just throw the entitities into a State monad. That way, everything is immutable, but the one, true entity *seems* to change over time.”

    My (limited) understanding of State Monads is that they reorganise a functional program to /simulate/ the presence of mutable state by making some passed state implicit.

    Do they allow outside/independent threads or components to access that state? I cant see how.

    At heart, (non-simulated) mutable state is needed for independent threads & components to intercommunicate. To communicate, they need a shared “address” where they both access state. An entity’s identity defines this shared address.

    Alternate but seemingly equivalent terminology: ‘Mutable State’ becomes ‘Message’, ‘Entity’ becomes ‘Channel’.

    In my User entity example, programs wanting to interact with a particular user-identity have a defined place to start or end their processing.

  3. June 20, 2009 at 11:14 pm

    […] in Rich Hickey’s monograph “On State and Identity” (and my own independent grasping toward these ideas) are […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: